Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-armed bandit problems with learned rewards
(USC Thesis Other)
Multi-armed bandit problems with learned rewards
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Multi-Armed Bandit Problems with Learned Rewards Yang Cao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (INDUSTRIAL AND SYSTEMS ENGINEERING) August 2019 Abstract Suppose there are K arms of a slot machine, with arm i having a deterministic value v i ;i = 1; 2;:::;K. The values v i are assumed to be independently generated from a common known distribution F , and are initially unknown to the player. In each game, the player chooses an arm to play, learns the arm's value and receives a reward equal to this value. The objective is to nd a policy that maximizes the expected sum of rewards in N games, where N is a random variable following a known distribution. We model the preceding problem as a stochastic dynamic programming problem, with the state being (n;x;k), where n is the game about to be played, x is the value of the best old arm, and k is the number of remaining new arms. We show that the optimal policy is a threshold policy, meaning that there are threshold values c(n;k) such that the optimal action in state (n;x;k) is to play x if x c(n;k) and play a new arm if x <c(n;k). We also show that c(n;k) increases in k. When the horizon N has increasing failure rate (IFR), we show that c(n;k) decreases in n, and obtain their values using a one-stage lookahead procedure. In addition, we derive an expression for the maximum expected sum of rewards, and propose ecient simulation algorithms when that quantity is dicult to compute analytically. The preceding results are also shown when there are innitely many arms. When N is not IFR, N is unknown, or F has unknown parameters, we use the preceding results to construct high-performance and easy-to-implement heuristic policies, and evaluate their performances using simulation. We also consider a variation of the problem where the player is only allowed to play either a new arm or the arm used in the last game. Supposing that F is known and the distribution of N is known, we show that the optimal policy is a threshold policy and the optimal threshold values c(n;k) increase in k. In addition, when N is IFR, we show that the optimal policy is a threshold stopping policy. Acknowledgement It has been a fascinating journey throughout my years at USC. It would not be so splendid and delightful without the supports and encouragements from a lot of people, and I would like to take this opportunity to express my deepest and sincerest gratitude. First of all, I would like to thank my advisor Professor Sheldon M. Ross for your guidance, mentorship and help during the past ve years. From you, I learned not only the essential knowledge and methods for conducting independent research, but also how to think critically and creatively and communicate clearly. You are a lifelong learner and researcher, and your passion, integrity and rigorousness set a great example for me in both work and life. I also would like to thank my qualifying and dissertation committee members: Professor John Gunnar Carlsson, Professor Ketan Savla, Professor Phebe Vayanos and Professor Sze-Chuan Suen. I really appreciate your insightful comments and inspiring suggestions on my research. Then I would like to thank my talented and passionate labmates: Zhengyu Zhang, Babak Haji, Maher Nouiehed, Mohammadjavad Azizi, Siyuan Song, Ye Wang, Tianyu Hao, Xiangfei Meng, Yunan Zhou, Miju Ahn, Jiachuan Chen, Xiang Gao, Jie Jin, Junyi Liu, Santiago Carvajal, Ahmed Alzanki, Olivia Evanson, Shuotao Diao, Chou-Chun Wu and Maximilian Zellner. Your sincereness, brilliance and humor make our PhD journey full of happiness. Last but not least, I would like to thank my parents for your love, encouragements and supports. You have raised me to be an independent, courageous and reliable person, and I will use my whole life to repay you. Moreover, I feel so blessed to meet my girlfriend and best friend Wu, Xinyan. With your love and accompany, everyday is so wonderful, beautiful and meaningful. 1 Contents List of Tables 5 List of Figures 7 1 Introduction 8 1.1 The model of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Applications of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Structure of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 General properties of optimal policy 17 2.1 Finiteness of the expected sum of rewards . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 General structure of optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 When N has increasing failure rate 29 3.1 Optimal policy: the one-stage lookahead policy . . . . . . . . . . . . . . . . . . . . . 30 3.2 Expression for the maximum expected sum of rewards . . . . . . . . . . . . . . . . . 37 3.3 Ecient use of simulation for estimating V . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 The raw estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Variance reduction using a control variable . . . . . . . . . . . . . . . . . . . 40 3.3.3 Variance reduction using post-stratication . . . . . . . . . . . . . . . . . . . 42 3.3.4 Performance evaluation of the estimators . . . . . . . . . . . . . . . . . . . . 43 2 4 When there are innitely many arms 45 4.1 General properties of optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 When N has increasing failure rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Heuristic policies when N follows a general distribution 52 5.1 When N follows a general distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.1 Static policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.2 Constant threshold policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1.3 Constant threshold policies using a bounded number of new arms . . . . . . . 63 5.1.4 The one-stage lookahead policy . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1.5 The mixture threshold policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 When N follows a mixture of geometric distributions . . . . . . . . . . . . . . . . . . 71 5.3 Performance evaluation of the heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 73 6 When N is unknown 76 6.1 When the reward distribution is unknown . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2 When the reward distribution is known . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 Performance evaluation of the heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 78 7 When the reward distribution has unknown parameters 83 7.1 Bayesian policies for general reward distributions . . . . . . . . . . . . . . . . . . . . 84 7.1.1 When N is unknown: the Bayesian probabilistic policy . . . . . . . . . . . . . 84 7.1.2 When N is IFR: the Bayesian probabilistic one-stage lookahead policy . . . . 85 7.2 Fiducial policies for normal reward distributions . . . . . . . . . . . . . . . . . . . . 87 7.2.1 When N is unknown: the ducial probabilistic policy . . . . . . . . . . . . . 89 7.2.2 When N is IFR: the ducial probabilistic one-stage lookahead policy . . . . . 89 7.3 Performance evaluation of the heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.3.1 The policies when N is unknown . . . . . . . . . . . . . . . . . . . . . . . . . 91 3 7.3.2 The policies when N is IFR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.4 When N is geometric: structure of optimal policy . . . . . . . . . . . . . . . . . . . . 95 7.5 When N is deterministic: structure of optimal policy . . . . . . . . . . . . . . . . . . 96 8 When the player is not allowed to play abandoned arms 98 8.1 When N follows a general distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.2 When N is IFR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 9 Conclusions and future directions 102 Bibliography 105 Appendix A Examples and proofs 113 A.1 An example of c(n;k) depending on k when N is not IFR . . . . . . . . . . . . . . . 113 A.2 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4 List of Tables 2.1 The initial states of the four processes . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 The states of the four processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 The initial states of the four processes . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 The states of the four processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 The states of policies and 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 The states of policies and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 The value and variance of the estimators when F is Uniform(0; 1) . . . . . . . . . . . 43 3.4 The value and variance of the estimators when F is Exponential(1) . . . . . . . . . . 44 5.1 Values of V (m) for m = 1; 2;:::; 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Expected sum of rewards of the three heuristics when F is Uniform(0; 1) . . . . . . . 69 5.3 Expected sum of rewards of the three heuristics when F is Exponential(1) . . . . . . 69 5.4 Expected sum of rewards of the heuristics when N is IFR and F is Uniform(0; 1) . . 73 5 5.5 Expected sum of rewards of the heuristics when N is IFR and F is Exponential(1) . 73 5.6 Expected sum of rewards of the heuristics when N follows a mixture of geometric distributions and F is Uniform(0; 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.7 Expected sum of rewards of the heuristics when N follows a mixture of geometric distributions and F is Exponential(1) . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.1 Sum of rewards when F is Uniform(0; 1) . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Sum of rewards when F is Exponential(1) . . . . . . . . . . . . . . . . . . . . . . . . 81 7.1 Expected sum of rewards when N is deterministic and = 0:5 . . . . . . . . . . . . . 93 7.2 Expected sum of rewards when N is geometric and = 0:5 . . . . . . . . . . . . . . 93 7.3 Expected sum of rewards when N is deterministic and = 3 . . . . . . . . . . . . . . 94 7.4 Expected sum of rewards when N is geometric and = 3 . . . . . . . . . . . . . . . 94 6 List of Figures 5.1 Values of V (m) for dierent m, when N = 100 . . . . . . . . . . . . . . . . . . . . . 57 5.2 Best value of m for dierent deterministic horizon N . . . . . . . . . . . . . . . . . . 58 5.3 Values of V (c) for dierent c, when N = 100 and F is Uniform(0; 1) . . . . . . . . . 61 5.4 Values of V (c) for dierent c, when N = 100 and F is Exponential(1) . . . . . . . . 62 5.5 Best value of c for dierent deterministic horizon N . . . . . . . . . . . . . . . . . . 63 5.6 V (c;m) for dierent m and xed c = 0:9, when N = 100 and F is Uniform(0; 1) . . . 66 5.7 V (c;m) for dierent m and xed c = 3:24, when N = 100 and F is Exponential(1) . 67 5.8 Best value of m for dierent deterministic horizon N and corresponding c . . . . . . 68 6.1 Expected average reward per game when F is Uniform(0; 1) . . . . . . . . . . . . . . 79 6.2 Expected average reward per game when F is Exponential(1) . . . . . . . . . . . . . 81 7.1 Expected average reward per game when = 0:5 . . . . . . . . . . . . . . . . . . . . 91 7.2 Expected average reward per game when = 3 . . . . . . . . . . . . . . . . . . . . . 92 7 Chapter 1 Introduction 1.1 The model of interest Suppose there are K arms of a slot machine, with arm i having a non-negative value v i , i = 1; 2;:::;K. The values v i are assumed to be independently generated from a common known distribution F and initially unknown to the player. In each game, the player chooses an arm to play, learns the arm's value, and receives a reward equal to this value. The objective is to nd a policy that maximizes the expected sum of rewards in N games whereN, called the horizon of the problem, is a random variable with a known distribution. Note that we can assume that the player keeps on playing forever, while only the rewards of the rst N games count towards the objective, and we make this assumption throughout. An arm is called an old arm if it has already been played and called a new arm otherwise. The value of an old arm is known to the player, while the value of a new arm is unknown and can be considered as a random variable following distribution F . In each game, although the player is allowed to use any arm, the optimal policy should only choose between playing a new arm or playing the best old arm so far. We model the preceding as a stochastic dynamic programming 8 problem with the state being (n;x;k), where n is the game about to be played, x is the value of the best old arm and k is the number of remaining new arms. For an arbitrary policy , we let V (n;x;k) denote the expected sum of rewards of a process starting from state (n;x;k) and employing policy . We let denote the optimal policy, and V (n;x;k) =V (n;x;k) denote the maximum expected sum of rewards from state (n;x;k) onward. In addition, we let V old (n;x;k) denote the maximum expected sum of rewards from state (n;x;k) onward given that the best old arm is played in game n, and V new (n;x;k) denote the maximum expected sum of rewards from state (n;x;k) onward given that a new arm is played in game n. Throughout this paper, we assume thatE[N 2 ]<1. LetN n =NnjNn denote the additional horizon after game n, given that the horizon is at least n games. Let X be a random variable following distribution F and let =E[X]. We assume that is nite. The optimality equation is V (n;x;k) = maxfV old (n;x;k);V new (n;x;k)g where V old (n;x;k) =x +P (N n 1)V (n + 1;x;k) and V new (n;x;k) = 8 > > > < > > > : +P (N n 1)E[V (n + 1; maxfx;Xg;k 1)]; for k 1 0; for k = 0 The optimal action in state (n;x;k) is to play the best old arm x if V old (n;x;k) V new (n;x;k), and to play a new arm otherwise. For notational convenience, we letV =V (1; 0;K) denote the expected sum of rewards of a brand new process under policy , and V =V (1; 0;K) denote the maximum expected sum of rewards of a brand new process. We also use the notation X + = maxfX; 0g for simplicity. 9 A policy is said to be a stopping policy, if once an old arm is played it will also be played throughout the remaining games. The term stop refers to \stop exploring new arms and keep playing the best old arm throughout". A policy is said to have a threshold structure, if there are non-negative numbers c(n;k);n 1; 0 k K such that in state (n;x;k) the policy plays x if x c(n;k), and plays a new arm ifx<c(n;k). The valuesc(n;k) are called the threshold values of the policy. A policy having a threshold structure with threshold values c(n;k) is called the threshold policy fc(n;k);n 1; 0kKg. In addition, if the threshold valuesc(n;k) decrease inn, then once the policy plays an old arm x, the policy will play x throughout the remaining games. Consequently, such a policy is also a stopping policy, and is called a threshold stopping policy. We will show that the optimal policy always has a threshold structure, and that when the horizon N has increasing failure rate (IFR) (that is, N is such that P (N = njN n) increases in n), the optimal policy is a threshold stopping policy. These results hold in both the nitely-many-arm problem and the innitely-many-arm problem. We are also interested in high-performance and easy-to-implement heuristic policies, especially when the optimal policy is dicult to obtain. Examples of heuristics considered are static policies, which start by playing a predetermined number of new arms and then play the best old arm throughout the remaining games, and probabilistic policies, which in each game play a new arm according to an exploration probability. Finally, an arm is said to be abandoned, if it is not played in the next game after its initial play. We are also interested in problems where the player is not allowed to play abandoned arms. In these problems, the player can only choose between playing a new arm or playing the arm used in the last game. Similar to the original problem, we will show that the optimal policy always has a threshold structure, and when the horizon is IFR the optimal policy is a threshold stopping policy. 10 1.2 Applications of the model The multi-armed bandit model has broad applications in practice, for example in online adver- tisement recommendation, news article recommendation and website design optimization. In these applications, the arms represent the available choices of advertisements/news articles/website de- signs, and as each customer (or each batch of customers) arrives, we choose one of these arms to display to the customer and earn certain amount of reward. The assumption of constant rewards is reasonable in these applications, especially in the context of online services. Major websites may have thousands of visitors per second. With such kind of heavy trac, it is costly to update the database after every customer visit, and almost impossible to let the decision on a visitor depend on the data of all previous visitors. A commonly adopted approach is to update the database after each batch of visitors (for example a thousand visitors). In this case, visitors in the same batch should use the same arm (assuming the decisions do not depend on user information). Thus, by law of large numbers, the sum of rewards in each batch can be approximately considered as a constant. In summary, in the batch updating bandit model, a game represents a batch of customers, and for each batch we choose an arm (advertisement/news article/website design) to display based on the data of previous batches, and earn a constant reward corresponding to that arm. When the arms' reward distribution is known and the horizon of the problem is IFR, results in Chapter 3 can be applied to obtain the optimal policy. In other cases, the high-performance heuristic policies in Chapters 5, 6 and 7 can be applied. 11 1.3 Related literature There is a broad literature on the stochastic multi-armed bandit problem. The problem assumes that there are K arms of a slot machine, and the rewards of playing arm i are independently and identically distributed (i.i.d.) random variables following distributionF (j i ), where i are unknown parameters and are assumed to be independent to each other. In the Bayesian setting of the problem, which assumes independent priors on the parameters of the arms and a discount factor on the rewards, it has been shown that the Gittins index policy has the maximum expected sum of discounted rewards (see Gittins and Jones [1], Whittle [2], Weber [3] and Tsitsiklis [4]). When we do not have priors on the unknown parameters, the objective of the problem is usually to minimize the expected total regret, where the regret in a game is dened as the dierence between the expected reward of the best arm and the expected reward of the arm played. In 1985, Lai and Robbins [5] proved that for any policy the expected total regret in n games has an asymptotic lower bound in the order of O(logn), and proposed a policy whose expected total regret achieves the order of the lower bound. Our model diers from the classic stochastic bandit model in that we assume the reward of each armi is a constantv i rather than a random variable. The constantsv i are independently generated from a common distributionF , and are learned once the arm is played. The most related study on the bandit model with learned rewards is by David and Shimkin ([6] and [7]). When F is known, the authors proved a lower bound on the expected total regret in n games for any policy. When an upper bound on the tail distribution of F is known, and the horizon N is deterministic and known, the authors proposed a static policy whose expected total regret has the same order as the lower bound. The authors also proposed policies for the cases of unknown horizon, unknown reward distribution and when the player cannot play abandoned arms, and proved upper bounds 12 on the expected total regret of the proposed policies. Our work directly characterizes the structure of the optimal policy when F is known, and further gives the optimal policy when the horizon N is IFR. Regarding the stochastic multi-armed bandit problem, there are many well-known bandit policies achieving the logarithmic order lower bound of regret. Auer, Cesa-Bianchi and Fischer [8] proved that when the arms' reward distributions have support in [0; 1], the Upper Condence Bound (UCB) policy and the n -greedy policy achieve the logarithmic lower bound. Cowan, Honda and Katehakis [9] proved that for normal bandits with unknown means and variances, a version of the UCB policy achieves the logarithmic lower bound. Another commonly used bandit policy is the Thompson sampling policy, whose idea is to play each arm with the probability of that arm being the best. Thompson [10] initially proposed this idea in 1933, and Agrawal and Goyal [11] showed that the policy has a logarithmic order of regret in 2012. Thompson sampling was also shown to have a good empirical performance (see Chapelle and Li [12]). Finally, we mention the -greedy policy, which in each game plays a random arm with probability and plays the arm with the best estimate otherwise. Although the policy theoretically has linear regret, it usually performs well in practice, especially over a short horizon (see Vermorel and Mohri [13], and Kuleshov and Precup [14]). For empirical comparisons of the classic bandit policies, see Vermorel and Mohri [13], Chapelle and Li [12], and Kuleshov and Precup [14]. For more literature on the classic bandit problem, see Robbins [15], Agrawal [16], Katehakis and Robbins [17], Lai [18], Herschkorn, Pek oz and Ross [19], Capp e et al. [20], Russo et al. [21], Sutton and Barto [22], and Lattimore and Szepesv ari [23]. For more literature on Bayesian bandits and Gittins index, see Gittins, Glazebrook and Weber [24], Gittins [25], Gittins and Jones [26], Gittins and Glazebrook [27], Chen and Katehakis [28], Katehakis and Veinott [29], and Lattimore and Szepesv ari [23]. An important generalization of the classic multi-armed bandit problem is the contextual bandit 13 problem. The problem assumes that in each game a context vector x i is observed for each arm i, and the reward of arm i follows distribution F (j;x i ) where are unknown parameters. The contextual bandit problem is a good model for online personalized advertising, personalized news article recommendation and clinical trial problems. For example, in the personalized advertising problem, the context vector can be used to model the user and advertisement information. A commonly studied setting of the contextual bandit problem supposes that the expected payo is a linear function of the context vector. Li, Chu, Langford and Schapire [30] proposed an upper condence bound policy for this problem, called the LinUCB policy, and empirically evaluated the performance of the policy in a new article recommendation application. Chu, Li, Reyzin and Schapire [31] proved an upper bound on the regret of a modied version of the LinUCB algorithm, and also proved that if the context vector has d dimensions, then for any policy, the worst case expected total regret inn games (nd 2 ) has a lower bound of orderO( p dn). Agrawal and Goyal [32] proposed a Thompson sampling policy for this problem and proved a regret bound of the policy. Based on the idea of Thompson sampling, May, Korda, Lee and Leslie [33] proposed an Optimistic Bayesian Sampling policy for the general contextual bandit problem, and empirically showed that the policy has a competitive performance in the context-free Bernoulli bandit setting, the linear payo setting and the news article recommendation application. For more literature on the contextual bandit problem, see Abe, Biermann and Long [34], Langford and Zhang [35], Dani, Hayes and Kakade [36], Agarwal et al. [37], Slivkins [38], Tang et al. [39], Joseph et al. [40], and Dud k [41]. There are many other variations of the classic multi-armed bandit problem, for example the restless bandit problem (Whittle [42], Weber and Weiss [43], Liu and Zhao [44], Guha, Munagala and Shi [45], and Bertsimas and Ni~ no-Mora [46]), the mortal multi-armed bandit problem (Chakrabarti, Kumar, Radlinski and Upfal [47]), the dueling bandit problem (Yue and Joachims [48], Sui, Zoghi, Hofmann and Yue [49], Chen and Frazier [50], Komiyama, Honda, Kashima and Nakagawa [51], 14 Dud k et al. [52], Pek oz, Ross and Zhang [53]), the combinatorial multi-armed bandit problem (Chen, Wang and Yuan [54], Gai, Krishnamachari and Jain [55], Chen et al. [56]), and the adver- sarial bandit problem (Auer, Cesa-Bianchi, Freund and Schapire [57] and [58], Auer and Chiang [59]). For more literature on the general topic of the multi-armed bandit problems, see Lattimore and Szepesv ari [23], Slivkins [60], Bubeck and Cesa-Bianchi [61], Berry and Fristedt [62], and Mahajan and Teneketzis [63]. For more applications of bandit models, see Katehakis and Derman [64], Press [65], Li, Chu, Langford and Schapire [30], Villar, Bowden and Wason [66], and Schwartz, Bradlow and Fader [67]. 1.4 Structure of the paper In Chapter 2, we consider the case where the horizon N follows a known general distribution. We show that when E[N 2 ] <1, the maximum expected sum of rewards V (n;x;k) is nite for any state (n;x;k). In addition, we show that the optimal policy always has a threshold structure, and the optimal threshold values c(n;k) increase in k, the number of remaining new arms. In Chapter 3, we consider the case whereN is IFR. We show that in this case the optimal threshold values c(n;k) decrease in n, implying that the optimal policy is a threshold stopping policy. We obtain the optimal threshold values using optimal stopping theory. In addition, we derive an expression for the maximum expected sum of rewards, and present ecient simulation algorithms when that quantity is dicult to compute analytically. In Chapter 4, we show that the preceding results also hold when there are innitely many arms. In Chapter 5, we revisit the case of general horizon, and propose several heuristic policies. We also propose a heuristic policy when the distribution of N is a mixture of geometric distributions, a 15 special case of a decreasing failure rate distribution. The performances of the heuristic policies are evaluated using simulation. In Chapter 6, we consider the case where the distribution of N is unknown. We propose heuristic policies and evaluate their performances using simulation. In Chapter 7, we suppose that the reward distribution F is not completely known, only being specied up to a set of unknown parameters. We propose heuristic policies for dierent cases, and evaluate their performances using simulation. In Chapter 8, we consider a variation of the problem where the player is not allowed to play abandoned arms. We show that when N follows a general distribution the optimal policy is a threshold policy and the optimal threshold values c(n;k) increase in k. We also show that when N is IFR, the optimal threshold values c(n;k) decrease in n, implying that the optimal policy is a threshold stopping policy. In Chapter 9, we summarize the results obtained, and propose directions for future research. 16 Chapter 2 General properties of optimal policy In this chapter, we consider the case whereN follows a known general distribution, and discuss the general properties of the optimal policy. This chapter is organized as follows. In Section 2.1, we show that when E[N 2 ]<1, V (n;x;k) is nite for any state (n;x;k). In Section 2.2, we show that the optimal policy always has a threshold structure, and the optimal threshold values c(n;k) increase in k, the number of remaining new arms. 2.1 Finiteness of the expected sum of rewards We rst show that if E[N 2 ] <1 then for any state (n;x;k), V (n;x;k) <1. In doing so, we assume that the player keeps on playing forever, while only the rewards earned up to gameN count towards the objective. Also, let N n = NnjN n denote the additional horizon after game n given that the horizon is at least n games. Lemma 1. For any policy and any state (n;x;k),V (n;x;k)x(1+E[N n ])+E[(N n +1)(N n + 2)]=2. 17 Proof. Let be an arbitrary policy and (n;x;k) be an arbitrary state. Consider a process that starts from state (n;x;k) and employs policy . Let A(i) denote the reward earned in game i of the process, i = n;n + 1;:::. Let X j denote the value of the j th played new arm in the process, j = 1; 2;:::. ThenX 1 ;X 2 ;::: are independently and identically distributed (i.i.d.) random variables following distribution F . Since N n is independent of X 1 ;X 2 ;:::, the expected sum of rewards under is V (n;x;k) =E " n+Nn X i=n A(i) # E " n+Nn X i=n maxfx;X 1 ;:::;X in+1 g # E " n+Nn X i=n (x +X 1 +::: +X in+1 ) # E 2 4 x(1 +N n ) + Nn+1 X j=1 X j (N n j + 2) 3 5 =x (1 +E[N n ]) + 1 X t=0 E 2 4 t+1 X j=1 X j (tj + 2) N n =t 3 5 P (N n =t) =x (1 +E[N n ]) + 1 X t=0 E 2 4 t+1 X j=1 X j (tj + 2) 3 5 P (N n =t) =x (1 +E[N n ]) + 1 X t=0 (t + 1)(t + 2) 2 P (N n =t) =x (1 +E[N n ]) + E[(N n + 1)(N n + 2)] 2 Proposition 2. If E[N 2 ]<1, then for any state (n;x;k), V (n;x;k)<1. Proof. Since E[N 2 ] =E[(n +N n ) 2 ]P (Nn) +E[N 2 jN <n]P (N <n) n 2 + 2nE[N n ] +E[(N n ) 2 ] P (N >n) 18 it follows that E[N 2 ]<1 implies that E[(N n ) 2 ]<1 for any n 1. The result then follows as for xed state (n;x;k),V (n;x;k) is uniformly bounded byx(1+E[N n ])+ E[(N n + 1)(N n + 2)]=2 for any policy . Therefore, throughout this paper, we always assume E[N 2 ]<1. 2.2 General structure of optimal policy Now we show that the optimal policy always has a threshold structure. Proposition 3. If it is optimal to play x in state (n;x;k), it is also optimal to play x 0 in state (n;x 0 ;k) for any x 0 x. Proof. It suces to show that for any x 0 x, V old (n;x 0 ;k)V new (n;x 0 ;k)V old (n;x;k)V new (n;x;k) or equivalently V old (n;x 0 ;k) +V new (n;x;k)V new (n;x 0 ;k) +V old (n;x;k) (2.1) Fixn,k,x andx 0 , and suppose thatx 0 x. We consider four processes: the rst starts from state (n;x 0 ;k) and plays the best old arm in game n, the second starts from state (n;x 0 ;k) and plays a new arm in gamen, the third starts from state (n;x;k) and plays the best old arm in gamen, and the fourth starts from state (n;x;k) and plays a new arm in game n. Assume that the player keeps on playing forever, while only the rewards earned up to game N count towards the objective. Let V i (s;t) denote the expected sum of rewards of process i from game s to game t, i = 1; 2; 3; 4; 1st. We will show that no matter what policies processes 2 19 and 3 follow, we can always construct policies for processes 1 and 4 such that V 1 (n;t) +V 4 (n;t)V 2 (n;t) +V 3 (n;t) for any tn which implies that E[V 1 (n;N)] +E[V 4 (n;N)]E[V 2 (n;N)] +E[V 3 (n;N)] Consequently, by letting process 2 follow the optimal policy after the initial play of a new arm, and process 3 follow the optimal policy after the inital play of the best old arm, the preceding implies that E[V 1 (n;N)] +E[V 4 (n;N)]V new (n;x 0 ;k) +V old (n;x;k) which implies equation (2.1) and thus yields the result. Letting X j ;j = 1; 2;:::;k be i.i.d random variables following distribution F , we couple the four processes such that in each process the j th played new arm has value X j , j = 1; 2;:::;k. The initial states of the four processes are summarized in the following table. Process Initial state Reward in game n State after game n 1 (n;x 0 ;k) x 0 (n + 1;x 0 ;k) 2 (n;x 0 ;k) X 1 (n + 1; maxfx 0 ;X 1 g;k 1) 3 (n;x;k) x (n + 1;x;k) 4 (n;x;k) X 1 (n + 1; maxfx;X 1 g;k 1) Table 2.1: The initial states of the four processes Starting from game n + 1, let processes 2 and 3 follow arbitrary policies. Let D i (t) denote the number of new arms played in process i from game n to game t, and E i (t) denote the number of times the player plays old arms in process i from gamen to gamet, thenD i (t) +E i (t) =tn + 1, i = 1; 2; 3; 4; tn. 20 Let T = minft :tn;D 2 (t) =D 3 (t)g be the rst game after which processes 2 and 3 have played the same number of new arms (T could be innity). We now construct the policies for processes 1 and 4. Let process 1 make the same decisions as process 3 from game n + 1 to game T , and make the same decisions as process 2 from gameT + 1 onward. Let process 4 make the same decisions as process 2 from game n + 1 to game T , and make the same decisions as process 3 from game T + 1 onward. We show that under such polices V 1 (n;t) +V 4 (n;t)V 2 (n;t) +V 3 (n;t) for any tn. First note that D 4 (t) =D 2 (t)>D 3 (t) =D 1 (t) for nt<T , D 1 (T ) =D 2 (T ) =D 3 (T ) =D 4 (T ), E 4 (t) = E 2 (t) < E 3 (t) = E 1 (t) for n t < T , and E 1 (T ) = E 2 (T ) = E 3 (T ) = E 4 (T ). Let D = D 1 (T ) and E = E 1 (T ). Let I i (j) denote the game of the j th play of new arms in process i, andJ i (j) denote the game of thej th play of old arms in processi, thenI 4 (j) =I 2 (j)<I 3 (j) =I 1 (j) for j = 1; 2;:::;D, and J 4 (j) = J 2 (j) > J 3 (j) = J 1 (j) for j = 1; 2;:::;E. The states of the four processes are summarized in the following table. Process Initial state State after game n State after game T 1 (n;x 0 ;k) (n + 1;x 0 ;k) (T + 1; maxfx 0 ;X 1 ;:::;X D g;kD) 2 (n;x 0 ;k) (n + 1; maxfx 0 ;X 1 g;k 1) (T + 1; maxfx 0 ;X 1 ;:::;X D g;kD) 3 (n;x;k) (n + 1;x;k) (T + 1; maxfx;X 1 ;:::;X D g;kD) 4 (n;x;k) (n + 1; maxfx;X 1 g;k 1) (T + 1; maxfx;X 1 ;:::;X D g;kD) Table 2.2: The states of the four processes Let S i (t) denote the best old value when game t is about to be played in process i, i = 1; 2; 3; 4, then S i (n) = 8 > > > < > > > : x 0 ; for i = 1; 2 x; for i = 3; 4 21 and S i (t) = maxfS i (n);X 1 ;X 2 ;:::;X D i (t1) g for t>n: Let A i (t) denote the value of the arm played in game t of process i, i = 1; 2; 3; 4; tn. Then A i (t) = 8 > > > < > > > : S i (t); if the best old arm is played in game t X D i (t1)+1 ; if a new arm is played in game t For any t =n;n + 1;:::;T , V i (n;t) = t X j=n A i (j) = D i (t) X j=1 X j + E i (t) X j=1 S i (J i (j)) Because game J i (j) is the j th game that process i plays an old arm, it follows that from game n to gameJ i (j) 1, processi playsj 1 times of old arms and J i (j)nj + 1 times of new arms. Thus, S i (J i (j)) = maxfS i (n);X 1 ;X 2 ;:::;X J i (j)nj+1 g Now, for j = 1; 2;:::;E, since J 1 (j) =J 3 (j)<J 2 (j) =J 4 (j), it follows that S 1 (J 1 (j))S 3 (J 3 (j)) = maxfx 0 ;X 1 ;:::;X J 3 (j)nj+1 g maxfx;X 1 ;:::;X J 3 (j)nj+1 g maxfx 0 ;X 1 ;:::;X J 3 (j)nj+1 ;:::;X J 2 (j)nj+1 g maxfx;X 1 ;:::;X J 3 (j)nj+1 ;:::;X J 2 (j)nj+1 g =S 2 (J 2 (j))S 4 (J 4 (j)) (2.2) 22 Consequently, for any t =n;n + 1;:::;T , V 1 (n;t)V 3 (n;t) = D 1 (t) X j=1 X j + E 1 (t) X j=1 S 1 (J 1 (j)) D 3 (t) X j=1 X j E 3 (t) X j=1 S 3 (J 3 (j)) = E 1 (t) X j=1 S 1 (J 1 (j)) E 3 (t) X j=1 S 3 (J 3 (j)) = E 3 (t) X j=1 S 1 (J 1 (j))S 3 (J 3 (j)) = E 2 (t) X j=1 S 1 (J 1 (j))S 3 (J 3 (j)) + E 3 (t) X j=E 2 (t)+1 S 1 (J 1 (j))S 3 (J 3 (j)) E 2 (t) X j=1 S 1 (J 1 (j))S 3 (J 3 (j)) E 2 (t) X j=1 S 2 (J 2 (j))S 4 (J 4 (j)) = D 2 (t) X j=1 X j + E 2 (t) X j=1 S 2 (J 2 (j)) D 4 (t) X j=1 X j E 4 (t) X j=1 S 4 (J 4 (j)) =V 2 (n;t)V 4 (n;t) where the rst inequality follows from the fact that S 1 (J 1 (j)) = maxfx 0 ;X 1 ;:::;X J 1 (j)nj+1 g = maxfx 0 ;X 1 ;:::;X J 3 (j)nj+1 g maxfx;X 1 ;:::;X J 3 (j)nj+1 g = S 3 (J 3 (j)) for j = 1; 2;:::;E, and the second inequality follows from Equation (2.2). Consequently, for any t = n;n + 1;:::;T , we have V 1 (n;t) +V 4 (n;t)V 2 (n;t) +V 3 (n;t). Now, it remains to show thatV 1 (n;t)+V 4 (n;t)V 2 (n;t)+V 3 (n;t) for anytT +1. Since process 1 is in the same state as process 2 after game T (see Table 2.2) and will make the same decisions as process 2 from game T + 1 onward, it follows that V 1 (T + 1;t) =V 2 (T + 1;t) for any tT + 1. Similarly, we can show that V 4 (T + 1;t) = V 3 (T + 1;t) for any t T + 1, which together with 23 V 1 (n;T ) +V 4 (n;T )V 2 (n;T ) +V 3 (n;T ) yields that V 1 (n;t) +V 4 (n;t) =V 1 (n;T ) +V 4 (n;T ) +V 1 (T + 1;t) +V 4 (T + 1;t) V 2 (n;T ) +V 3 (n;T ) +V 2 (T + 1;t) +V 3 (T + 1;t) =V 2 (n;t) +V 3 (n;t) for any tT + 1. The proof is now complete. The proposition immediately implies that the optimal policy has a threshold structure. Corollary 4. There exist threshold values c(n;k);n 1; 0 k K such that the optimal action in state (n;x;k) is to play x if xc(n;k), and to play a new arm if x<c(n;k). The threshold values c(n;k) are referred to as the optimal threshold values. The following propo- sition shows that the optimal threshold values c(n;k) increase in k, the number of remaining new arms. Proposition 5. For xed n, the optimal threshold values c(n;k) increase in k. Proof. It suces to show that for any k 1, V new (n;x;k + 1)V old (n;x;k + 1)V new (n;x;k)V old (n;x;k) or equivalently, V new (n;x;k + 1) +V old (n;x;k)V old (n;x;k + 1) +V new (n;x;k) (2.3) Fixn,x andk. We consider four processes: the rst starts from state (n;x;k + 1) and plays a new arm in gamen, the second starts from state (n;x;k + 1) and plays the best old arm in gamen, the 24 third starts from state (n;x;k) and plays a new arm in game n, and the fourth starts from state (n;x;k) and plays the best old arm in game n. Assume that the player keeps on playing forever, while only the rewards earned up to gameN count towards the objective. Let V i (s;t) denote the expected sum of rewards of process i from games to gamet,i = 1; 2; 3; 4; 1st. We show that no matter what policies processes 2 and 3 follow, we can construct policies for processes 1 and 4 such that V 1 (n;t) +V 4 (n;t) =V 2 (n;t) +V 3 (n;t) for any tn which implies that E[V 1 (n;N)] +E[V 4 (n;N)] =E[V 2 (n;N)] +E[V 3 (n;N)] Consequently, by letting process 2 follow the optimal policy after the initial play of the best old arm, and process 3 follow the optimal policy after the initial play of a new arm, the preceding implies that E[V 1 (n;N)] +E[V 4 (n;N)] =V old (n;x;k + 1) +V new (n;x;k) which implies equation (2.3) and thus yields the result. LettingX 1 ;X 2 ;::: be i.i.d random variables following distribution F , we couple the four processes such that in each process the j th played new arm has value X j , j = 1; 2;:::. The initial states of the four processes are summarized in the following table. Process Initial state Reward in game n State after game n 1 (n;x;k + 1) X 1 (n + 1; maxfx;X 1 g;k) 2 (n;x;k + 1) x (n + 1;x;k + 1) 3 (n;x;k) X 1 (n + 1; maxfx;X 1 g;k 1) 4 (n;x;k) x (n + 1;x;k) Table 2.3: The initial states of the four processes 25 Starting from game n + 1, let processes 2 and 3 follow arbitrary policies. Let D i (t) denote the number of new arms played in process i from game n to game t, i = 1; 2; 3; 4; t n. Let T = minft :tn;D 2 (t) =D 3 (t)g be the rst game after which processes 2 and 3 have played the same number of new arms (T could be innity). We now construct the policies for processes 1 and 4. Let process 1 make the same decisions as process 3 from gamen+1 to gameT , and make the same decisions as process 2 from gameT + 1 onward. Let process 4 make the same decisions as process 2 from gamen+1 to gameT , and make the same decisions as process 3 from gameT +1 onward. Then D 4 (t) =D 2 (t)<D 3 (t) =D 1 (t)k forntT 1, andD 1 (T ) =D 2 (T ) =D 3 (T ) =D 4 (T )k. Let D =D 1 (T ). Note that process 1 is always able to match the decisions of process 3 from game n + 1 to game T , because process 1 starts with one new arm more than process 3. Process 4 is also able to match the decisions of process 2 from game n + 1 to game T , because D 4 (t) < k for t =n;n + 1;:::;T 1. After gameT , processes 1 and 2 are in the same state, and processes 3 and 4 are in the same state. Thus, from gameT + 1 onward, process 1 is able to match the decisions of process 2, and process 4 is able to match the decisions of process 3. The states of the four processes are summarized in the following table. Process Initial state State after game n State after game T 1 (n;x;k + 1) (n + 1; maxfx;X 1 g;k) (T + 1; maxfx;X 1 ;:::;X D g;k + 1D) 2 (n;x;k + 1) (n + 1;x;k + 1) (T + 1; maxfx;X 1 ;:::;X D g;k + 1D) 3 (n;x;k) (n + 1; maxfx;X 1 g;k 1) (T + 1; maxfx;X 1 ;:::;X D g;kD) 4 (n;x;k) (n + 1;x;k) (T + 1; maxfx;X 1 ;:::;X D g;kD) Table 2.4: The states of the four processes Since process 1 matches process 3 and process 4 matches process 2 from game n to game T , we have that for any t = n;n + 1;:::;T , V 1 (n;t) = V 3 (n;t) and V 4 (n;t) = V 2 (n;t). After game T , processes 1 and 2 are in the same state, and processes 3 and 4 are in the same state. Since process 1 matches process 2 and process 4 matches process 3 from game T + 1 onward, we have that for 26 any tT + 1, V 1 (T + 1;t) =V 2 (T + 1;t) and V 4 (T + 1;t) =V 3 (T + 1;t). Consequently, under the preceding policies, we have V 1 (n;t) +V 4 (n;t) =V 2 (n;t) +V 3 (n;t) for any tn The proof is now complete. We now show that the optimal threshold values are greater than or equal to the mean of the reward distribution if there is at least one remaining new arm. Proposition 6. If k 1 then c(n;k). Proof. Suppose that k 1, and consider a process that starts from state (n;x;k) where x is an arbitrary constant such that 0x<. We show that the optimal policy should play a new arm in game n. First, we note that the optimal policy should play a new arm at certain point of the process, because otherwise the expected sum of rewards is x +xE[N n ], which is less than the expected sum of rewards under a policy that plays a new arm in game n and then keeps playing that new arm. Let T denote the number of games policy plays the best old arm x before playing the rst new arm. If T > 0, then with X denoting a random variable following distribution F , we have V (n;x;k) =x +xE[minfT 1;N n g] +P (N n T ) +E[V (n +T + 1; maxfx;Xg;k 1)]P (N n T + 1) However, letting 0 denote a policy that plays a new arm in game n, plays the best old arm from gamen + 1 to gamen +T , and makes the same decisions as policy from gamen +T + 1 onward, 27 the expected sum of rewards of the process under policy 0 is V 0(n;x;k) = +E[maxfx;Xg]E[minfT 1;N n g] +E[maxfx;Xg]P (N n T ) +E[V (n +T + 1; maxfx;Xg;k 1)]P (N n T + 1) >V (n;x;k) which contradicts to the optimality of V (n;x;k). Therefore, we must haveT = 0, implying that the optimal action in state (n;x;k) is to play a new arm in game n. Remark: Proposition 5 shows that the optimal threshold values c(n;k) increase in k. In general, c(n;k) depends on k, even for k 1 (an example of this is given in the appendix). A special case where c(n;k) does not depend onk fork 1, is whenN is IFR. We will further study this case in Chapter 3. 28 Chapter 3 When N has increasing failure rate In this chapter, we consider the case where the horizon N follows a known increasing failure rate (IFR) distribution; that is, N is such that P (N =njNn) increases in n. The IFR horizon model has many practical applications. For example, when the horizon is xed, or when the horizon is geometric, which is equivalent to an innite horizon problem but with discounted rewards. This chapter is organized as follows. In Section 3.1 we give the optimal policy when N is IFR. The optimal policy is a threshold stopping policy, and the optimal threshold values are given by a one-stage lookahead procedure. In Section 3.2, we derive an expression for the maximum expected sum of rewards, and in Section 3.3, we discuss how simulation can be eciently used to estimate that quantity. 29 3.1 Optimal policy: the one-stage lookahead policy In Chapter 2, we showed that whenN follows a general distribution, the optimal policy always has a threshold structure. That is, with the state denoted by (n;x;k) where n is the game about to be played, x is the best old value so far and k is the number of remaining new arms, there exist threshold values c(n;k) such that the optimal action in state (n;x;k) is to play x if x c(n;k), and to play a new arm otherwise. Recall that a threshold stopping policy is a threshold policy whose threshold valuesc(n;k) decrease in n. In other words, under a threshold stopping policy, once the player plays an old arm, the player should keep playing that arm throughout the remaining games. The term stop refers to \stop exploring new arms and keep playing the best old arm throughout". We rst show that when N is IFR, the optimal policy is a threshold stopping policy. Before doing so, we need the following lemma. Lemma 7. Suppose that a process starts from an arbitrary state (n 0 ;x 0 ;k 0 ) and employs an arbi- trary threshold stopping policyfc(i;j);in 0 ; 0jk 0 g. Let A(n) denote the reward earned in game n of this process, nn 0 . Then E[A(n)] increases in n. Proof. It is equivalent to show that E[A(n + 1)]E[A(n)] for any nn 0 . Fixn, and let (n;x;k) denote the state of the process when gamen is about to be played. Consider two cases of the state. Case 1: xc(n;k). In this case, arm x is played from game n onward, so we have E[A(n + 1)] = E[A(n)] =x. 30 Case 2: x<c(n;k). In this case, a new arm X is played in game n, and A(n + 1) = 8 > > > < > > > : maxfx;Xg; if maxfx;Xgc(n + 1;k 1) Y; otherwise where Y denotes the value of another new arm. Thus, in this case E[A(n + 1)] =E[maxfx;Xgj maxfx;Xgc(n + 1;k 1)]P (maxfx;Xgc(n + 1;k 1)) +P (maxfx;Xg<c(n + 1;k 1)) =E[A(n)] Combining the two cases yields the result. Proposition 8. When N is IFR, if it is optimal to play x in state (n;x;k), it is also optimal to play x in state (n + 1;x;k). That is, the optimal policy is a threshold stopping policy. Proof. We prove this by induction on k. Recall that we assume that the player keeps on playing forever, while only the rewards earned up to game N count towards the objective. When k = 0, the result is immediate as there is no remaining new arm. Now, suppose that we have proven the result for up to k 1 new arms, and consider the case of k new arms. Fix n, x and k, and suppose that it is optimal to play x in state (n;x;k). We prove that it is also optimal to play x in state (n + 1;x;k) by contradiction. Suppose that V new (n + 1;x;k)>V old (n + 1;x;k), then the optimal policy plays a new armX in gamen+1. After that game, if the horizon has not ended yet, policy will be in state (n+2; maxfx;Xg;k1). By induction hypothesis, from then on policy should follow a threshold stopping policy. Letfc (i;j);in + 2; 0jk 1g 31 denote that optimal threshold stopping policy, and A(i) denote the reward earned in gamei under that policy, i = n + 2;n + 3;:::. With N n = NnjN n denoting the additional horizon after game n given that the horizon is at least n games, we have V (n + 1;x;k) =E[X + n+1+N n+1 X i=n+2 A(i)] = + 1 X i=n+2 E[A(i)]P (N n+1 in 1) and V (n;x;k) =x +P (N n 1)V (n + 1;x;k) =x +P (N n 1) + 1 X i=n+2 E[A(i)]P (N n+1 in 1) ! =x +P (N n 1) + 1 X i=n+2 E[A(i)]P (N n+1 in 1)P (N n 1) =x +P (N n 1) + 1 X i=n+2 E[A(i)]P (N n in) where the last equality follows from the fact that for in + 2, P (N n+1 in 1)P (N n 1) =P (NijNn + 1)P (Nn + 1jNn) =P (NijNn) =P (N n in) We now construct two policies 1 and 2 , and show that V 1 (n;x;k)V (n;x;k)V 2 (n + 1;x;k)V (n + 1;x;k) Let policy 1 play a new arm in state (n;x;k). Without loss of generality, we assume that this new arm also has value X. Consequently, after game n, policy 1 will have k 1 remaining new arms and a best old value of maxfx;Xg, which are the same quantities as policy has after gamen + 1 (see Table 3.1). 32 Policy State before game n State after game n State after game n + 1 (n;x;k) (n + 1;x;k) (n + 2; maxfx;Xg;k 1) 1 (n;x;k) (n + 1; maxfx;Xg;k 1) Table 3.1: The states of policies and 1 From gamen + 1 onward, let 1 follow a threshold stopping policy with threshold values c 1 (i;j) = c (i + 1;j),i =n + 1;n + 2;::: ;j = 0; 1;:::;k 1. Then the reward earned in gamei under policy 1 has the same distribution as A(i + 1), i =n + 1;n + 2;:::. Consequently, under policy 1 , the expected sum of rewards from state (n;x;k) onward is V 1 (n;x;k) = +E[ n+Nn X i=n+1 A(i + 1)] = + 1 X i=n+1 E[A(i + 1)]P (N n in) On the other hand, let policy 2 play armx in state (n + 1;x;k), and play a new arm in state (n + 2;x;k). Without loss of generality, we assume that this new arm also has value X. Consequently, after gamen + 2, policy 2 will havek 1 remaining new arms and a best old value of maxfx;Xg, which are the same quantities as policy has after game n + 1 (see Table 3.2). Policy State before game n + 1 State after game n + 1 State after game n + 2 (n + 1;x;k) (n + 2; maxfx;Xg;k 1) 1 (n + 1;x;k) (n + 2;x;k) (n + 3; maxfx;Xg;k 1) Table 3.2: The states of policies and 2 From gamen + 3 onward, let 2 follow a threshold stopping policy with threshold values c 2 (i;j) = c (i 1;j),i =n + 3;n + 4;::: ;j = 0; 1;:::;k 1. Then the reward earned in gamei under policy 2 has the same distribution as A(i 1), i =n + 3;n + 4;:::. 33 Consequently, under policy 2 , the expected sum of rewards from state (n + 1;x;k) onward is V 2 (n + 1;x;k) =x +P (N n+1 1) +E[ n+1+N n+1 X i=n+3 A(i 1)] =x +P (N n+1 1) + 1 X i=n+3 E[A(i 1)]P (N n+1 in 1) Now, V 1 (n;x;k)V (n;x;k) = + 1 X i=n+1 E[A(i + 1)]P (N n in)xP (N n 1) 1 X i=n+2 E[A(i)]P (N n in) =x + (E[A(n + 2)])P (N n 1) + 1 X i=n+2 (E[A(i + 1)]E[A(i)])P (N n in) x + (E[A(n + 2)])P (N n+1 1) + 1 X i=n+2 (E[A(i + 1)]E[A(i)])P (N n+1 in) =x + (E[A(n + 2)])P (N n+1 1) + 1 X j=n+3 (E[A(j)]E[A(j 1)])P (N n+1 jn 1) = + 1 X i=n+2 E[A(i)]P (N n+1 in 1) xP (N n+1 1) 1 X i=n+3 E[A(i 1)]P (N n+1 in 1) =V (n + 1;x;k)V 2 (n + 1;x;k) where the inequality follows from the fact thatN is IFR and thatE[A(n+2)]E[A(n+3)] , which has been shown in Lemma 7. However, since policy 2 plays arm x in state (n + 1;x;k) and we have supposed that V new (n + 1;x;k)>V old (n + 1;x;k), it follows that V 1 (n;x;k)V (n;x;k)V (n + 1;x;k)V 2 (n + 1;x;k) V (n + 1;x;k)V old (n + 1;x;k) =V new (n + 1;x;k)V old (n + 1;x;k) > 0 34 which contradicts to the optimality of V (n;x;k). Therefore, we must have V new (n + 1;x;k) V old (n + 1;x;k), which implies that it is optimal to play x in state (n + 1;x;k). The result then follows by induction. Remark: When the horizonN is deterministic, Proposition 8 can also be shown by an interchange argument: if the player plays an old arm x and then plays a new arm X in the next game, it would be better to rst play X and then play maxfx;Xg. However, we could not use this interchange argument when N is not deterministic, because the horizon may end after the initial choice. Proposition 8 shows that when N is IFR, the optimal policy is a threshold stopping policy. This enables us to model the problem as an optimal stopping problem, where \stop" means to \stop exploring new arms and keep playing the best old arm". Using optimal stopping theory, we can obtain the optimal policy using the following one-stage lookahead procedure. Let B =f(n;x;k) :k = 0g[f(n;x;k) :k 1;x +xE[N n ] +E[maxfx;Xg]E[N n ]g denote the set of states in which there is no remaining new arms or there is at least one new arm but stopping is at least as good as continuing for exactly one more period and then stopping. If we can show that B is a closed set of states, which means that once we enter set B we cannot make transition to states outside setB, then by optimal stopping theory, it is optimal to stop if the state (n;x;k)2B, and to play a new arm if (n;x;k) = 2B. This policy is called the one-stage lookahead policy. Lemma 9. When N is IFR, B is a closed set of states. 35 Proof. For any state (n;x;k), the set of states to which it can make transition is f(n + 1;x;k)g[f(n + 1;x 0 ;k 1) :x 0 xg Thus, to show that B =f(n;x;k) :k = 0g[f(n;x;k) :k 1;x +xE[N n ] +E[maxfx;Xg]E[N n ]g is a closed set of states, it suces to show that the function f(n;x;k) =x +xE[N n ]E[maxfx;Xg]E[N n ] =xE[(Xx) + ]E[N n ] is increasing in n and x and decreasing in k, which are easy to verify when N is IFR. Now, as we have shown that B is a closed set of states, it follows from optimal stopping theory that the one-stage lookahead policy is optimal. Proposition 10. When N is IFR, the optimal action in state (n;x;k) is to play x throughout the remaining games if (n;x;k)2B, and to play a new arm in game n if (n;x;k) = 2B. Note that the set B =f(n;x;k) :k = 0g[f(n;x;k) :k 1;x +xE[N n ] +E[maxfx;Xg]E[N n ]g =f(n;x;k) :k = 0g[f(n;x;k) :k 1;xE[(Xx) + ]E[N n ] 0g =f(n;x;k) :xc(n;k)g where c(n;k) = 8 > > > < > > > : 0; if k = 0 inffx :xE[(Xx) + ]E[N n ] 0g; if k 1 (3.1) The optimal policy is a threshold stopping policy with threshold values c(n;k). 36 In practice, the value of c(n;k) can be numerically obtained using the binary search algorithm, as function f(x) =xE[(Xx) + ]E[N n ] is strictly increasing in x. Remark: We conjecture that whenN has decreasing failure rate (DFR), the optimal threshold valuesc(n;k) increase in n. 3.2 Expression for the maximum expected sum of rewards In this section, we derive an expression for the maximum expected sum of rewards V under the one-stage lookahead policy. Let S denote the optimal stopping time, which is the rst game that the optimal policy plays an old arm. Note that we always have SK + 1, because there are K arms in total. Let X j denote the value of the j th played new arm under the optimal policy, j = 1; 2;:::. Then S = minfn :n 2; maxfX 1 ;X 2 ;:::;X n1 gc(n;Kn + 1)g wherec(n;Kn+1) are given in Equation (3.1). For notational convenience, letc n =c(n;Kn+1), n = 1; 2;:::;K + 1. Under the optimal policy, the player plays new arms up to game S 1, and play the best old arm from game S onward. Letting A(n) denote the reward earned in game n, n = 1; 2;:::, we have A(n) = 8 > > > < > > > : X n ; n<S maxfX 1 ;:::;X S1 g; nS 37 The distribution of S can be obtained as follows. Since fS >ig()fmaxfX 1 ;:::;X i1 g<c i g we have P (S >i) =P (maxfX 1 ;:::;X i1 g<c i ) =F (c i ) i1 Consequently, P (S =i) =P (S >i 1)P (S >i) =F (c i1 ) i2 F (c i ) i1 (3.2) for i = 2; 3;:::;K + 1 To obtain the expression for the maximum expected sum of rewards, we need the following lemma. Lemma 11. For 2iK + 1 P (S =i;A(j)<t) = 8 > > > > > > > < > > > > > > > : F (c i1 ) i3 F (minft;c i1 g)F (c i ) i2 F (minft;c i g); 1ji 2 F (c i1 ) i2 F (t)F (c i ) i2 F (minft;c i g); j =i 1 F (minft;c i1 g) i2 F (t)F (minft;c i g) i1 ; ji The proof of Lemma 11 is in the appendix. We now give the expression for the maximum expected sum of rewards. Proposition 12. When N is IFR, the maximum expected sum of rewards is V = 1 X n=1 n X j=1 Z 1 0 1 K+1 X i=2 P (S =i;A(j)<t) ! dtP (N =n) where P (S =i;A(j)<t) is given in Lemma 11. 38 Proof. Under the optimal policy, V =E[ N X j=1 A(j)] = 1 X n=1 E[ n X j=1 A(j)jN =n]P (N =n) = 1 X n=1 E[ n X j=1 A(j)]P (N =n) where the last equality follows from the fact that N is independent to A(j);j = 1; 2;:::. Now, for j = 1; 2;:::, E[A(j)] = Z 1 0 (1P (A(j)<t))dt = Z 1 0 1 K+1 X i=2 P (S =i;A(j)<t) ! dt: Therefore, V = 1 X n=1 E[ n X j=1 A(j)]P (N =n) = 1 X n=1 n X j=1 E[A(j)]P (N =n) = 1 X n=1 n X j=1 Z 1 0 1 K+1 X i=2 P (S =i;A(j)<t) ! dtP (N =n): Remark: When N is geometric, the optimal threshold values c n are a constant (except when there is no remaining new arm). A simpler expression for V can be obtained in this case (see Section 5.1.3). 39 3.3 Ecient use of simulation for estimating V In this section, we discuss how simulation can be eciently used to estimate V . We rst give the raw estimator, and then use control variable and post-stratication techniques to reduce the variance of the estimator. 3.3.1 The raw estimator The idea of the raw estimator is to repeatedly simulate samples of the sum of rewards under the optimal policy, and use the mean of the samples as an estimate of V . Algorithm The raw estimator of V Input: M = number of repetitions 1: for i = 1; 2;:::;M do 2: Simulate a sample V i of the sum of rewards under the optimal policy Output: Use V = P M i=1 V i =M as an estimate of V 3.3.2 Variance reduction using a control variable Control variable (see Ross [68]) is a variance reduction technique commonly used in simulation. Suppose that we are interested in estimating =E[X] using simulation. The raw estimator is to repeatedly generate samples of X, and use the mean of the samples to estimate . However, when simulating a sample of X, if we can also obtain a sample of another random variable Y , whose expectation Y =E[Y ] is known, then for any constant c, the random variable X +c(Y Y ) has expectation , and thus is an unbiased estimator of . 40 The best value of c that minimizes the variance of the estimator is c = Cov(X;Y ) Var(Y ) with the variance of the estimator attains its minimum min c Var(X +c(Y Y )) =Var(X +c (Y Y )) =Var(X)(1(X;Y )) where (X;Y ) is the correlation between X and Y . Although (X;Y ) is usually unknown, in practice we can use the simulation data to approximate c . Letting X i ;Y i ;i = 1; 2;:::;M be samples of random variables X and Y , X = P M i=1 X i =M and Y = P M i=1 Y i =M, we can use ~ c = P M i=1 (X i X)(Y i Y ) P M i=1 (Y i Y ) 2 as the constant of the control variable estimator and obtain a good variance reduction performance. For estimating the maximum expected sum of rewards V , the stopping time S, which is the rst game that the optimal policy plays an old arm, seems to be a good choice of control variable, because it is highly correlated to the sum of rewards. When S is small, it seems that the player has seen a very good arm at the early stage of the process, which tends to result in a large value of the sum of rewards. When S is large, it seems that the player has not seen any good arms until the late stage of the process, which tends to result in a small value of the sum of rewards. Thus, S seems to be a good control variable for estimating V . In Section 3.2, we have derived the distribution of S as P (S =n) =F (c n1 ) n2 F (c n ) n1 ; n = 2; 3;:::;K + 1 where c(n) = 8 > > > < > > > : inffx :xE[(Xx) + ]E[N n ] 0g; n = 1; 2;:::;K 0; n =K + 1;K + 2;::: 41 The expectation of S can be obtained by S =E[S] = K+1 X n=2 nP (S =n) (3.3) The control variable estimator is shown as follows. Algorithm The control variable estimator of V Input: M = number of repetitions 1: Compute S using Equation (3.3) 2: for i = 1; 2;:::;M do 3: Simulate a sampleV i of the sum of rewards and output the stopping timeS i in this simulation 4: Let V = P M i=1 V i =M and S = P M i=1 S i =M 5: Let ~ c = P M i=1 (V i V )(S i S)= P M i=1 (S i S) 2 Output: Use ~ V = P M i=1 (V i + ~ c(S i S ))=M as an estimate of V 3.3.3 Variance reduction using post-stratication Post-stratication (see Ross [68]) is another commonly used variance reduction technique. Suppose that we are interested in estimating = E[X] using simulation. The raw estimator simulates samplesX 1 ;X 2 ;:::;X M of random variableX, and uses their mean X = P M i=1 X i =M as an estimate of. Suppose that when simulating a sampleX i , we can also obtain a sampleY i of another random variableY , whose distribution is known to us. Lettingn(y) denote the number of samplesY i having value y, the random variable X y P i:Y i =y X i n(y) P (Y =y) has expectation , and thus is an unbiased estimator of . To use the post-stratication technique for estimating V , we stratify on the stopping timeS. The post-stratication estimator is shown as follows. 42 Algorithm The post-stratication estimator of V Input: M = number of repetitions 1: for i = 1; 2;:::;M do 2: Simulate a sampleV i of the sum of rewards and output the stopping timeS i in this simulation 3: Let n(s) denote the number of simulation runs with stopping time S i =s, s = 2; 3;:::;K + 1 Output: Use ^ V = P K+1 s=2 P i:S i =s V i =n(s) P (S =s) as an estimate of V 3.3.4 Performance evaluation of the estimators In this section, we evaluate the variance reduction performances of the control variable estimator and the post-stratication estimator, comparing to the raw estimator. We suppose that the horizon is deterministic and there are suciently many new arms, and consider two cases of the reward distribution: Uniform(0; 1) and Exponential(1). We performM = 100; 000 repetition runs, and compute the value and variance of the nal estimators. The results are shown in the following two tables. Raw estimator Post-stratication estimator Control variable estimator N Value Variance Value Variance Value Variance 10 7.38 2.18e-5 7.39 5.21e-6 7.39 5.20e-6 20 16.07 4.99e-5 16.07 1.22e-5 16.07 1.21e-5 50 43.46 1.40e-4 43.46 3.44e-5 43.46 3.44e-5 100 90.53 2.96e-4 90.52 7.32e-5 90.51 7.33e-5 200 186.39 6.14e-4 186.38 1.54e-4 186.37 1.53e-4 500 478.18 1.58e-3 478.12 4.01e-4 478.16 3.96e-4 1000 968.87 3.22e-3 968.90 8.18e-4 968.87 8.00e-4 Table 3.3: The value and variance of the estimators when F is Uniform(0; 1) 43 Raw estimator Post-stratication estimator Control variable estimator N Value Variance Value Variance Value Variance 10 19.11 1.01e-3 19.15 4.72e-4 19.12 4.72e-4 20 46.58 4.33e-3 46.63 2.02e-3 46.64 2.01e-3 50 147.58 2.95e-2 147.47 1.38e-2 147.34 1.36e-2 100 345.62 1.26e-1 345.13 5.85e-2 345.42 5.88e-2 200 797.51 5.32e-1 796.70 2.51e-1 797.15 2.50e-1 500 2356.68 3.48 2355.85 1.66 2355.89 1.65 1000 5283.36 14.40 5285.67 7.05 5284.10 6.92 Table 3.4: The value and variance of the estimators when F is Exponential(1) We observe that the control variable estimator and the post-stratication estimator have similar performances and both signicantly reduce the variance comparing to the raw estimator: the variance is reduced by around 75% in the uniform reward case and by around 50% in the exponential reward case. The result is somewhat surprising because it is known that proportional stratication on a variable usually has a better variance reduction performance than using that variable as a control variable. The post-stratication method usually has a similar performance as proportional stratication, so it is surprising that the performance of the control variable estimator is as good as the performance of the post-stratication estimator. 44 Chapter 4 When there are innitely many arms In this chapter, we consider the case where there are innitely many arms. We show that the results shown in the case of nitely many arms also hold when there are innitely many arms. In Section 4.1, we show that whenN follows a general distribution, the optimal policy has a threshold structure. In Section 4.2, we show that whenN is IFR, the optimal policy is the one-stage lookahead policy. 4.1 General properties of optimal policy Recall that we denote the state by (n;x;k) where n is the game about to be played, x is the best old value so far, and k is the number of remaining new arms. In this section, we allow k to be innity, as we assume there are innitely many arms. For an arbitrary policy , let V (n;x;k) denote the expected sum of rewards from state (n;x;k) onward under policy . In addition, let V (n;x;k) denote the maximum expected sum of rewards from state (n;x;k) onward. We rst show that when there are innitely many arms, the maximum expected sum of rewards V (n;x;1) increases in x, the value of the best old arm. 45 Proposition 13. V (n;x;1) increases in x. Proof. The result immediately follows from a coupling argument. Now we show that when N is general, the optimal policy has a threshold structure. Proposition 14. There exist threshold valuesc(n;1) such that the optimal action in state (n;x;1) is to play x if xc(n;1) and to play a new arm otherwise. Proof. By the optimality equation, it is optimal to play x in state (n;x;1) if x +P (N n 1)V (n + 1;x;1) +P (N n 1)E[V (n + 1; max(x;X);1)] or, equivalently, if x +P (N n 1)V (n + 1;x;1) +P (N n 1) F (x)V (n + 1;x;1) + Z 1 x V (n + 1;y;1)dF (y) or, equivalently, if x +P (N n 1) Z 1 x V (n + 1;y;1)dF (y) (1F (x))V (n + 1;x;1) which, because V (n;x;1) is an increasing function of x, is equivalent to x +P (N n 1)E (V (n + 1;X;1)V (n + 1;x;1)) + Now, since (V (n + 1;X;1)V (n + 1;x;1)) + is a decreasing function of x, it follows that the left side of the preceding equation increases in x whereas the right side decreases. Therefore, upon setting c(n;1) = inf x :x +P (N n 1)E (V (n + 1;X;1)V (n + 1;x;1)) + the result is proven. 46 4.2 When N has increasing failure rate Let 1 denote the one-stage lookahead policy, whose threshold values are c(n;k) = 8 > > > < > > > : 0; k = 0 inffx :xE [(Xx) + ]E[N n ] 0g; k = 1; 2;:::;1 (4.1) In Chapter 3 we showed that policy 1 is optimal whenN is IFR and there are nitely many arms. That is, for any state (n;x;k) such that k<1, the optimal action is to play x if xc(n;k), and to play a new arm otherwise. In this section, we show that when N is IFR and there are innitely many arms, the one-stage lookahead policy 1 is still optimal. To show the optimality of the one-stage lookahead policy 1 when there are innitely many arms, we need the following lemma. For an arbitrary threshold policy , letc (n;k) denote its threshold value when game n is about to be played and there are k remaining new arms, n = 1; 2;::: ;k = 0; 1;:::;1. Lemma 15. Let denote an arbitrary threshold policy such that c (i;j) =c (i;1);i 1;j 1. For any n;x, lim k!1 V (n;x;k) =V (n;x;1) Proof. Consider two processes: the rst starts from state (n;x;1) and the second starts from state (n;x;k). Suppose that both processes employ policy . Then the expected sum of rewards of the rst process is V (n;x;1), and the expected sum of rewards of the second process is V (n;x;k). Let A(i) denote the reward earned in game i of process 1, and B(i) denote the reward earned in game i of process 2, i =n;n + 1;:::. Then V (n;x;1) =E " n+Nn X i=n A(i) # V (n;x;k) =E " n+Nn X i=n B(i) # 47 Couple the two processes such that they have the same horizon. Let X j denote thej th played new arm in process 1, j = 1; 2;:::. We also couple the two processes such that the j th played new arm in process 2 also has value X j , j = 1; 2;:::;k. Note that under policy , the two processes make the same decisions, except that process 2 stops playing new arms after having played k new arms. Consequently, the sum of rewards of process 1 can exceed the sum of rewards of process 2 only if the additional horizon after game n, denoted by N n , is greater than or equal to k. Let T denote the rst game that process 2 plays a dierent arm than process 1. Then Tn +k. In addition, if T <1, then it implies that processes 1 and 2 have played k new arms by game T 1, and process 1 plays a new arm in game T . Thus, V (n;x;1)V (n;x;k) =E " n+Nn X i=n A(i) # E " n+Nn X i=n B(i) # =E " n+Nn X i=T (A(i)B(i)) # =E " n+Nn X i=T (A(i)B(i)) N n k # P (N n k) =E " N X i=T (A(i)B(i)) Nn +k # P (N n k) Now, for i =T;T + 1;:::, A(i) maxfx;X 1 ;:::;X k+iT+1 g B(i) = maxfx;X 1 ;:::;X k g 48 Thus, V (n;x;1)V (n;x;k) =E " N X i=T (A(i)B(i)) Nn +k # P (N n k) E " N X i=T (maxfx;X 1 ;:::;X k+iT+1 g maxfx;X 1 ;:::;X k g) Nn +k # P (N n k) =E " N X i=T max j=k+1;:::;k+iT+1 X j maxfx;X 1 ;:::;X k g + Nn +k # P (N n k) E " N X i=T max j=k+1;:::;k+iT+1 X j Nn +k # P (N n k) E 2 4 N X i=T k+iT+1 X j=k+1 X j Nn +k 3 5 P (N n k) In addition, since X k+1 ;X k+2 ;::: are independent to T and N, it follows that E 2 4 N X i=T k+iT+1 X j=k+1 X j Nn +k 3 5 =E " N X i=T (iT + 1) Nn +k # =E 2 4 NT X j=0 (j + 1) Nn +k 3 5 E 2 4 Nnk X j=0 (j + 1) Nn +k 3 5 =E 2 4 N n+k X j=0 (j + 1) 3 5 =E (N n+k + 1)(N n+k + 2) 2 E[N(N + 1)] 2 where the last inequality follows from the fact that N is IFR. 49 Consequently, V (n;x;1)V (n;x;k) E 2 4 N X i=T k+iT+1 X j=k+1 X j Nn +k 3 5 P (N n k) E[N(N + 1)] 2 P (N n k) Letting k!1, we have lim k!1 (V (n;x;1)V (n;x;k)) E[N(N + 1)] 2 lim k!1 P (N n k) = 0 which, together with the fact that V (n;x;k) increases in k, yields the result. Proposition 16. When N is IFR and there are innitely many arms, the one-stage lookahead policy 1 is optimal. That is, the optimal action in state (n;x;1) is to play x if x c(n;1), and to play a new arm otherwise, where the optimal threshold values c(n;1) are given in Equation (4.1). Proof. It is equivalent to show that V (n;x;1) =V 1 (n;x;1) for any n and x. Let 2 denote the optimal policy for a process starting from state (n;x;1). Then it follows from Proposition 14 that 2 is a threshold policy with threshold values c 2 (i;1);in. Without loss of generality, let c 2 (i;j) =c 2 (i;1) for in, j 1 and c 2 (i; 0) = 0 for in. Now, it follows from Lemma 15 that lim k!1 V 1 (n;x;k) =V 1 (n;x;1) lim k!1 V 2 (n;x;k) =V 2 (n;x;1) 50 Recall that whenN is IFR, policy 1 is optimal for any state (n;x;k) such that 0k<1. Thus, V (n;x;k) =V 1 (n;x;k)V 2 (n;x;k) Consequently, V (n;x;1) =V 2 (n;x;1) = lim k!1 V 2 (n;x;k) lim k!1 V 1 (n;x;k) =V 1 (n;x;1) which by the optimality of V (n;x;1) implies that V (n;x;1) =V 1 (n;x;1) 51 Chapter 5 Heuristic policies whenN follows a gen- eral distribution In Chapter 2, we showed that the optimal policy has a threshold structure in general. In Chapter 3, we further showed that when the horizon N is IFR, the optimal policy is a threshold stopping policy with the threshold values given by a one-stage lookahead procedure. However, when N is not IFR, the optimal policy need not be a threshold stopping policy, and the optimal threshold values are dicult to obtain. Nevertheless, the previously shown structural results are still useful for constructing heuristic policies. In this chapter, we consider the case where the horizon N has a general distribution, and propose heuristics which are easy to implement and have good empirical performances. In Section 5.1, we propose ve heuristics for the general case. The rst three heuristics are easy to implement, and most computations are performed before the games begin, which is useful when we need to make quick decisions in each game or when the system is not exible for updating its policy. The last two heuristics may require more computations in each game. All heuristics except the rst one have good empirical performances. In Section 5.2, we consider the case where the horizon N follows a mixture of geometric distributions, and propose a 52 new heuristic for this case. Finally, in Section 5.3, we evaluate the performances of the heuristics using simulation. For the purpose of simplicity, we suppose that there are innitely many arms in this chapter. For problems with nitely many arms (K arms), simply follow the policies until all K new arms have been played, and then keep playing the best old arm throughout. 5.1 When N follows a general distribution We propose ve heuristic policies for the case when N follows a general distribution. 5.1.1 Static policies A static policy is dened as a policy that plays new arms in the rst m games and then plays the best old arm in the remaining games, where m is a pre-determined integer. The policy is also referred to as the m-policy. Recall that we assume that the player keeps on playing forever, while only the rewards earned in the rst N games count towards the objective. For a given m, let V (m) denote the expected sum of rewards under the static policy that plays m new arms. Let A(n) denote the reward earned in game n, n 1. Let X j denote the value of the j th played new arm, j 1, and =E[X 1 ]. Then V (m) =E " N X n=1 A(n) # =E 2 4 minfm;Ng X n=1 A(n) 3 5 +E 2 4 N X n=minfm;Ng+1 A(n) 3 5 =E 2 4 minfm;Ng X n=1 A(n) 3 5 +E " N X n=m+1 A(n) # For n = 1; 2;:::; minfm;Ng, we have A(n) = X n , because new arms are played in these games. For n = m + 1;m + 2;:::, we have A(n) = max i=1;:::;m X i , because the best old arm is played in 53 these games. Thus, V (m) =E 2 4 minfm;Ng X n=1 A(n) 3 5 +E " N X n=m+1 A(n) # =E 2 4 minfm;Ng X n=1 X n 3 5 +E (Nm) + max i=1;:::;m X i =E [minfm;Ng] +E max i=1;:::;m X i E (Nm) + (5.1) where E max i=1;:::;m X i = Z 1 0 P max i=1;:::;m X i t dt = Z 1 0 (1 (F (t)) m )dt Remark: WhenN follows a geometric distribution, the expression ofV (m) can be further simplied. Suppose that N is geometric with parameter . Then E (Nm) + =E[NmjN >m]P (N >m) = 1 (1) m and E[minfm;Ng] =E N (Nm) + = 1 (1) m : Consequently, V (m) = +E [max i=1;:::;m X i ] (1) m : In practice, the optimal value (or near-optimal value) ofm can be numerically obtained by searching through the possible values ofm. Specially, whenN is IFR, we can show thatV (m) is a unimodal 54 function of m, implying that the optimal value of m can be obtained more eciently using the binary search algorithm. Proposition 17. When N is IFR, V (m) is a unimodal function of m. Proof. It suces to show that for m = 1; 2;:::, if V (m + 1)V (m), then V (m + 2)V (m + 1). Now, V (m + 1)V (m) =E [minfm + 1;Ng] +E max i=1;:::;m+1 X i E (Nm 1) + E [minfm;Ng]E max i=1;:::;m X i E (Nm) + =E max i=1;:::;m+1 X i max i=1;:::;m X i E (Nm 1) + E max i=1;:::;m X i P (Nm + 1) =E " X m+1 max i=1;:::;m X i + # E (Nm 1) + E max i=1;:::;m X i P (Nm + 1) Therefore, V (m + 1)V (m) is equivalent to E " X m+1 max i=1;:::;m X i + # E (Nm 1) + E max i=1;:::;m X i P (Nm + 1) or, equivalently, E [(Nm 1) + ] P (Nm + 1) E [max i=1;:::;m X i ] E (X m+1 max i=1;:::;m X i ) + or, equivalently, E [Nm 1jNm + 1] E [max i=1;:::;m X i ] E (X m+1 max i=1;:::;m X i ) + Therefore, it suces to show that if E [Nm 1jNm + 1] E [max i=1;:::;m X i ] E (X m+1 max i=1;:::;m X i ) + then E [Nm 2jNm + 2] E [max i=1;:::;m+1 X i ] E (X m+2 max i=1;:::;m+1 X i ) + 55 which follows as when N is IFR, E [Nm 1jNm + 1]E [Nm 2jNm + 2] and E [max i=1;:::;m X i ] E (X m+1 max i=1;:::;m X i ) + E [max i=1;:::;m+1 X i ] E (X m+2 max i=1;:::;m+1 X i ) + : Remark: When N is not IFR, V (m) need not be a unimodal function of m. For example, suppose that N = 8 > > > < > > > : 10; with probability 0:99 1000; with probability 0:01 The values of V (m) for m = 1; 2;:::; 20 are shown in the following table. m 1 2 3 4 5 6 7 8 9 10 V (m) 9.95 12.93 14.18 14.72 14.92 14.91 14.79 14.58 14.31 14.0 m 11 12 13 14 15 16 17 18 19 20 V (m) 14.07 14.13 14.18 14.22 14.26 14.29 14.32 14.34 14.36 14.38 Table 5.1: Values of V (m) for m = 1; 2;:::; 20 We observe from the table that in this example, V (m) is not a unimodal function of m. Figure 5.1 shows the value of V (m) for dierent m, with a deterministic horizon N = 100. When F is Uniform(0; 1), the best m is m = 13, and V (m ) = 87:2857. When F is Exponential(1), the 56 bestm ism = 26, andV (m ) = 311:227. We also observe that in both cases, V (m) is a unimodal function of m. Figure 5.1: Values of V (m) for dierent m, when N = 100 Figure 5.2 shows the best value of m for dierent horizon N (supposing that N is deterministic). 57 Figure 5.2: Best value of m for dierent deterministic horizon N 5.1.2 Constant threshold policies For a xed constant c, a constant threshold policy is dened as a policy that starts by playing new arms until seeing an arm with value greater than or equal to c, and then plays that arm from then on. The constantc is predetermined before the games. The policy is also referred to as thec-policy. Recall that we assume that the player keeps on playing forever, while only the rewards earned in the rst N games count towards the objective. For a given c, let V (c) denote the expected sum of rewards under the constant threshold policy with threshold value c. Let A(n) denote the reward earned in gamen,n 1. LetX j denote the value of thej th played new arm,j 1, and =E [X 1 ]. 58 LetT denote the number of new arms played until seeing a value greater than or equal to c. Then T is a geometric random variable with parameter 1F (c). Now, V (c) =E " N X n=1 A(n) # =E 2 4 minfT;Ng X n=1 A(n) 3 5 +E 2 4 N X n=minfT;Ng+1 A(n) 3 5 =E 2 4 minfT;Ng X n=1 A(n) 3 5 +E " N X n=T+1 A(n) # (5.2) For n = 1; 2;:::;T , we have A(n) =X n as new arms are played in these games. Thus, E 2 4 minfT;Ng X n=1 A(n) 3 5 =E 2 4 minfT;Ng X n=1 X n 3 5 In addition, sinceN is independent ofX 1 ;X 2 ;:::, andT only depends onX 1 ;X 2 ;:::;X T , it follows from Wald's equation that E 2 4 minfT;Ng X n=1 A(n) 3 5 =E 2 4 minfT;Ng X n=1 X n 3 5 =E [minfT;Ng] (5.3) On the other hand, for n =T;T + 1;:::, we haveA(n) =X T , whereX T has the same distribution as the conditional distribution of X given Xc, and the value of X T is independent to T . Thus, E " N X n=T+1 A(n) # =E (NT ) + X T =E [XjXc]E (NT ) + (5.4) Plugging Equation (5.3) and Equation (5.4) into Equation (5.2) gives V (c) =E 2 4 minfT;Ng X n=1 A(n) 3 5 +E " N X n=T+1 A(n) # =E [minfT;Ng] +E [XjXc]E (NT ) + (5.5) 59 where E [XjXc] = Z 1 c x f(x) 1F (c) dx Remark: WhenN follows a geometric distribution, the expression ofV (c) can be further simplied. Suppose that N is geometric with parameter . Let X denote the value of the rst arm played, A denote the additional sum of rewards after the rst game andR(c) denote the sum of rewards in all games, then R(c) =X +A. Now, E[A] =E[AjN > 1](1), and E[AjN > 1] =E[AjN > 1;X <c]F (c) +E[AjN > 1;Xc](1F (c)) =E[R(c)]F (c) +E[XjXc]E[N](1F (c)) Thus, V (c) =E [R(c)] = + (1)E [XjXc] (1F (c))= 1 (1)F (c) Moreover, the optimal policy (when N is geometric) is a constant threshold policy with threshold value c = inffx : x= +E[maxfx;Xg](1)=g given by one-stage lookahead (see Section 3.1 and Section 4.2). Note that under any constant threshold policy, if an arm is not re-used in the next game after its initial play, it will not be used again. Consequently, the maximum expected sum of rewards is V =V new (1; 0;1) =V new (1;c ;1) =V old (1;c ;1) =c = In practice, the optimal value (or near-optimal value) of c can be numerically obtained using searching algorithms. 60 The following two gures show the value of V (c) for dierent c, with a deterministic horizon N = 100. The rst gure assumesF is Uniform(0; 1), and the second gure assumesF is Exponential(1). When F is Uniform(0; 1), the best threshold value is c = 0:9, and V (c ) = 90:5001. When F is Exponential(1), the best threshold value is c = 3:240, and V (c ) = 342:793. Figure 5.3: Values of V (c) for dierent c, when N = 100 and F is Uniform(0; 1) 61 Figure 5.4: Values of V (c) for dierent c, when N = 100 and F is Exponential(1) Figure 5.5 shows the best value of c for dierent horizon N (supposing N is deterministic). 62 Figure 5.5: Best value of c for dierent deterministic horizon N Remark: We conjecture that whenN is IFR and there are innitely many arms,V (c) is a unimodal function of c. Figure 5.3 and Figure 5.4 provide some evidence for this conjecture when N is deterministic. 5.1.3 Constant threshold policies using a bounded number of new arms For xed constants c and m, we consider a policy that starts by playing new arms until either a value of c is reached or m new arms have been played. At that point, the policy stops playing new arms, and plays the best old arm from then on. The values of c and m are pre-determined before the games. The policy is also referred to as the bounded constant threshold policy or the (c;m)-policy. 63 Recall that we assume that the player keeps on playing forever, while only the rewards earned in the rstN games count towards the objective. For a given pair (c;m), letV (c;m) denote the expected sum of rewards under the bounded constant threshold policy with the threshold value c and the bound of m new arms. Let A(n) denote the reward earned in game n, n 1. Let X j denote the value of the j th played new arm, j 1, and let =E [X 1 ]. Letting T denote the number of new arms played until seeing a value greater than or equal to c, then T is a geometric random variable with parameter 1F (c). Now, V (c;m) =E " N X n=1 A(n) # =E 2 4 minfm;T;Ng X n=1 A(n) 3 5 +E 2 4 N X n=minfm;T;Ng+1 A(n) 3 5 (5.6) For n = 1; 2;:::; minfm;Tg, we have A(n) =X n as new arms are played in these games. Thus, E 2 4 minfm;T;Ng X n=1 A(n) 3 5 =E 2 4 minfm;T;Ng X n=1 X n 3 5 In addition, sinceN is independent toX 1 ;X 2 ;:::, andT only depends onX 1 ;X 2 ;:::;X T , it follows from Wald's equation that E 2 4 minfm;T;Ng X n=1 A(n) 3 5 =E 2 4 minfm;T;Ng X n=1 X n 3 5 =E [minfm;T;Ng] (5.7) On the other hand, E 2 4 N X n=minfm;T;Ng+1 A(n) 3 5 =E 2 4 N X n=minfm;Tg+1 A(n) 3 5 =E 2 4 N X n=minfm;Tg+1 A(n) Tm 3 5 P (Tm) +E 2 4 N X n=minfm;Tg+1 A(n) T >m 3 5 P (T >m) =E " N X n=T+1 A(n) Tm # P (Tm) +E " N X n=m+1 A(n) T >m # P (T >m) (5.8) Now, if given thatTm, thenA(n) =X T forn =T;T +1;:::, whereX T has the same distribution as the conditional distribution of X given that X c, and the value of X T is independent to T . 64 Consequently, E " N X n=T+1 A(n) Tm # =E (NT ) + X T jTm =E [XjXc]E (NT ) + jTm (5.9) On the other hand, if given that T > m, then A(n) = max i=1;:::;m X i for n = m + 1;m + 2;:::. Consequently, E " N X n=m+1 A(n) T >m # =E (Nm) + max i=1;:::;m X i T >m =E max i=1;:::;m X i T >m E (Nm) + =E max i=1;:::;m X i max i=1;:::;m X i <c E (Nm) + (5.10) Thus, by combining Equation (5.8), Equation (5.9) and Equation (5.10) together, we obtain that E 2 4 N X n=minfm;T;Ng+1 A(n) 3 5 =E " N X n=T+1 A(n) Tm # P (Tm) +E " N X n=m+1 A(n) T >m # P (T >m) =E [XjXc]E (NT ) + jTm P (Tm) +E max i=1;:::;m X i max i=1;:::;m X i <c E (Nm) + P (T >m) (5.11) Plugging Equation (5.7) and Equation (5.11) into Equation (5.6) gives V (c;m) =E 2 4 minfm;T;Ng X n=1 A(n) 3 5 +E 2 4 N X n=minfm;T;Ng+1 A(n) 3 5 =E [minfm;T;Ng] +E [XjXc]E (NT ) + jTm P (Tm) +E max i=1;:::;m X i max i=1;:::;m X i <c E (Nm) + P (T >m) (5.12) where E [XjXc] = Z 1 c x f(x) 1F (c) dx 65 and E max i=1;:::;m X i max i=1;:::;m X i <c = Z c 0 P max i=1;:::;m X i >x max i=1;:::;m X i <c dx = Z c 0 1 F (x) F (c) m dx In practice, the optimal values (or near-optimal values) of the pair (c;m) can be obtained numeri- cally using searching algorithms. In Section 5.1.2, we found that when N = 100 and F is Uniform(0; 1), the best c-policy uses a threshold value c = 0:9. It is usually costly to search for the best pair of (c;m). Instead, we can nd a good pair of (c;m), by xing c = c and searching for the best m corresponding to that c (see Figure 5.6). For c = 0:9, the best value of m is 47, and V (c;m) = 90:5061. Figure 5.6: V (c;m) for dierent m and xed c = 0:9, when N = 100 and F is Uniform(0; 1) For another example, when N = 100 and F is Exponential(1), the best c-policy uses a threshold 66 value c = 3:24. To nd a good pair of (c;m), we x c = 3:24 and search for the best m corre- sponding to thatc (see Figure 5.7). Forc = 3:24, the best value ofm is 64, andV (c;m) = 344:084. Figure 5.7: V (c;m) for dierent m and xed c = 3:24, when N = 100 and F is Exponential(1) Consider the case of deterministic horizon. For each value of N, we use the best c in the c-policy, and search for the best m corresponding to that c. Figure 5.8 shows the best value of m for each horizon N. 67 Figure 5.8: Best value of m for dierent deterministic horizon N and corresponding c Finally, we compare the performances of the three heuristics for dierent deterministic horizon N. We consider two cases of the reward distribution F : Uniform(0; 1) and Exponential(1). The m-policy uses the best value of m, and the c-policy uses the best value of c. For the (c;m) policy, since it is costly to nd the best pair of (c;m), we use the same value of c as the c-policy, and use the best value of m corresponding to that c. The maximum expected sum of rewards is obtained using the ecient simulation algorithms in Chapter 3, as the one-stage lookahead policy is optimal for deterministic horizons. The simulation results are shown in Table 5.2 and Table 5.3. We observe that in all cases considered, the c-policy and the (c;m)-policy both have close-to-optimal performances. The (c;m)-policy is slightly better than the c-policy, but the dierence is not very signicant, especially when F is Uniform(0; 1). The m-policy, on the other hand, does not perform well. 68 N m-policy c-policy (c;m)-policy Maximum 10 6.800 7.366 7.375 7.389 20 15.00 16.04 16.05 16.07 50 41.40 43.43 43.44 43.45 100 87.29 90.50 90.51 90.53 200 181.4 186.4 186.4 186.4 500 469.8 478.1 478.1 478.1 1000 956.8 968.9 968.9 968.9 Table 5.2: Expected sum of rewards of the three heuristics when F is Uniform(0; 1) N m-policy c-policy (c;m)-policy Maximum 10 16.50 18.93 19.00 19.08 20 40.71 46.11 46.33 46.57 50 131.14 146.08 146.73 147.50 100 311.23 342.79 344.08 345.53 200 726.22 792.01 794.38 797.22 500 2175.30 2346.66 2351.69 2357.56 1000 4914.53 5265.41 5274.11 5281.65 Table 5.3: Expected sum of rewards of the three heuristics when F is Exponential(1) Remark: We observe that adding a bound on the number of new arms to play does not bring too much gain to the constant threshold policy, even by using the best m for the given c. However, it is worth mentioning that if the values of the rst m arms do not reach c, stopping and playing the best old arm (whose expected value is cm=(m + 1) whenF is Uniform(0; 1)) is signicantly better than 69 continuing searching for an arm reaching valuec in the remainingNm games. The reason of the gain on the expected sum of rewards not being signicant is that the event of havingm consecutive arms with values all less than c is unlikely to occur. 5.1.4 The one-stage lookahead policy In this section, we consider the one-stage lookahead policy. In Chapter 3 and Chapter 4 we showed that when N is IFR, the one-stage lookahead policy is optimal. When N is not IFR, although it need not be optimal, the one-stage lookahead policy still performs well. The main idea of the policy is as follows. In each gamen, we compare the expected sum of rewards if the best old arm x is played throughout the remaining games, namely x +xE[N n ], with the expected sum of rewards if a new armX is played in gamen and then the best old arm maxfx;Xg is played from game n + 1 onward, namely +E[maxfx;Xg]E[N n ]. The decision in game n is to play arm x if the former value is larger, and to play a new arm otherwise. In addition, because f(x) =x +xE[N n ]E[maxfx;Xg]E[N n ] =xE[(Xx) + ]E[N n ] is strictly increasing inx, it follows that the one-stage lookahead policy is a threshold policy, using c(n) = inffx :xE[(Xx) + ]E[N n ] 0g (5.13) as the threshold value in game n. Remark: When N is not IFR, the threshold values c(n) given in Equation (5.13) need not decrease in n. Thus, the preceding one-stage lookahead policy need not be a threshold stopping policy. 70 5.1.5 The mixture threshold policy The last policy we consider for the general case is called the mixture threshold policy. In each game n, if we are given that N n =m, then from the results in the IFR case (see Section 3.1 and Section 4.2), the optimal action is to play x if xc m and to play a new arm otherwise, where c m = inffx :xE[(Xx) + ]m 0g The mixture threshold policy uses the weighted average of the threshold values, namely c(n) = 1 X m=0 c m P (N n =m) (5.14) as the threshold value in game n. 5.2 When N follows a mixture of geometric distributions Suppose that the distribution of N, denoted by G, is a mixture of r geometric distributions G 1 ;G 2 ;:::;G r . That is, suppose that P (N =n) =G(n) = r X i=1 w i G i (n); n = 1; 2;::: whereG i (n) = (1 i ) n1 i ,i = 1; 2;:::;r, andw 1 ;w 2 ;:::;w r are non-negative constants such that P r i=1 w i = 1. The values w 1 ;w 2 ;:::;w r represent the prior weight on each geometric distribution. 71 If given that N >n, the additional horizon after game n has distribution P (Nn =kjN >n) = P (N =n +k;N >n) P (N >n) = P (N =n +k) P (N >n) = P r i=1 w i (1 i ) n+k1 i P r i=1 w i (1 i ) n = r X i=1 w i (1 i ) n P r i=1 w i (1 i ) n (1 i ) k1 i = r X i=1 w (n) i G i (k) which is a mixture of geometric distributions G 1 ;G 2 ;:::;G r with posterior weights w (n) i = w i (1 i ) n P r i=1 w i (1 i ) n Now we consider a new policy called the mixture of geometrics policy. When game n is about to be played, the remaining horizon is N (n 1)jN >n 1, which is a mixture of geometric distributions with posterior weights w (n1) 1 ;w (n1) 2 ;:::;w (n1) r . If the remaining horizon follows one of the geometric distributions, say G i , then from the results in the IFR case (see Section 3.1 and Section 4.2), the optimal policy is a constant threshold policy with threshold value c i = inffx :xE (Xx) + 1 i i 0g (5.15) where X is a random variable following distribution F . Now, with the remaining horizon having a mixture of geometric distributions, the mixture of geometrics policy uses c(n) = r X i=1 w (n1) i c i as the threshold value in game n. 72 5.3 Performance evaluation of the heuristics First, we consider the case of IFR horizon, where the one-stage lookahead policy is optimal. Ex- amples of IFR distributions considered are (i) a constant, (ii) a uniform distribution and (iii) a binomial distribution. For each case of the horizon, we also consider two cases of the reward dis- tribution: Uniform(0; 1) and Exponential(1). The simulation results (see Table 5.4 and Table 5.5) show that thec-policy, the (c;m)-policy and the mixture threshold policy all have close-to-optimal performances. Horizon m-policy c-policy (c;m)-policy Mixture threshold One-stage lookahead Constant(20) 15.00 16.04 16.05 16.07 16.07 Uniformf1; 20g 7.29 7.91 7.92 7.90 7.93 Binomial(20; 0:5) 7.60 8.22 8.22 8.24 8.24 Table 5.4: Expected sum of rewards of the heuristics when N is IFR and F is Uniform(0; 1) Horizon m-policy c-policy (c;m)-policy Mixture threshold One-stage lookahead Constant(20) 40.71 46.11 46.33 46.58 46.58 Uniformf1; 20g 18.20 21.22 21.24 21.29 21.30 Binomial(20; 0:5) 18.70 21.53 21.56 21.66 21.68 Table 5.5: Expected sum of rewards of the heuristics when N is IFR and F is Exponential(1) Second, we consider the following examples of the mixture of geometric distributions, which are special cases of DFR distributions. 73 D 1 = 8 > > > < > > > : Geometric(0:2); with probability 0:5 Geometric(0:1); with probability 0:5 D 2 = 8 > > > > > > > < > > > > > > > : Geometric(0:4); with probability 0:25 Geometric(0:2); with probability 0:5 Geometric(0:1); with probability 0:25 An upper bound on the maximum expected sum of rewards can be obtained as follows. If given exactly which geometric distribution N follows, say G i , then from the results in Section 5.1.2, the expected sum of rewards under the optimal policy is c i = i where c i is dened in Equation (5.15). Consequently, whenN follows a mixture of geometric distributions, P r i=1 w i c i = i is an upper bound on the maximum expected sum of rewards. For each case of the horizon, we also consider two cases of the reward distribution: Uniform(0; 1) and Exponential(1). The simulation results are shown in Table 5.6 and Table 5.7. We observe that the mixture of geometrics policy has the best performance and is close to the upper bound on the optimum. The c-policy and the (c;m)-policy also perform well, slightly better than the one-stage lookahead policy and the mixture threshold policy. The m-policy, however, does not perform well. Horizon m-policy c-policy (c;m)- policy One-stage lookahead Mixture threshold Mixture geometrics Upper bound D 1 5.021 5.495 5.495 5.491 5.474 5.507 5.526 D 2 3.626 3.988 3.988 3.988 3.977 3.998 4.010 Table 5.6: Expected sum of rewards of the heuristics when N follows a mixture of geometric distributions and F is Uniform(0; 1) 74 Horizon m-policy c-policy (c;m)- policy One-stage lookahead Mixture threshold Mixture geometrics Upper bound D 1 12.28 14.67 14.67 14.67 14.59 14.70 14.80 D 2 8.604 10.31 10.31 10.29 10.25 10.36 10.41 Table 5.7: Expected sum of rewards of the heuristics when N follows a mixture of geometric distributions and F is Exponential(1) 75 Chapter 6 When N is unknown In practice, a common situation is where we do not have any information about the horizon. In such case, a policy that performs well over any horizon is desired. In this chapter, we consider the case where N is unknown and propose several probabilistic policies, which play new arms according to some exploration probabilities and play the best old arm otherwise. For the purpose of simplicity, we suppose that there are innitely many arms in this chapter. For problems with nitely many arms (K arms), simply follow the policies until all K new arms have been played, and then keep playing the best old arm throughout. 6.1 When the reward distribution is unknown We propose three choices of the exploration probabilities, which do not require the knowledge of the reward distribution. The rst choice is to use a constant exploration probability. Consider a policy that in each game plays a new arm with a constant probability, and plays the best old arm otherwise. This policy is referred to as the-policy. The idea of this policy is similar to the idea of the epsilon-greedy policy 76 for the classic bandit problem, which is known to perform well in practice. The second choice is to let the exploration probabilities decrease in time. Let (1);(2);::: be a decreasing sequence of numbers in [0; 1] such that P 1 n=1 (n) =1. In each game n, the policy plays a new arm with probability (n), and plays the best old arm otherwise. The condition P 1 n=1 (n) =1 is required to ensure that the expected number of new arms to play is not bounded when the horizon is innite. For example, the exploration probabilities can be chosen as(n) = 1=n, and this policy is referred to as the n-policy. The third choice is to let the exploration probabilities depend on the number of new arms played so far. Ifr new arms have been played, the policy will play a new arm in the next game with probability (r),r = 0; 1;:::. For example, the exploration probabilities can be chosen as (r) = 1=(r + 1), and this policy is referred to as the r-policy. Remark: LetN(n) denote the number of new arms played in the rstn games under ther-policy,n = 1; 2;:::. ThenfN(n);n = 1; 2;:::g is a counting process with an event denoting a play of a new arm. The inter-arrival time of the k th event, denoted by X k , is a geometric random variable with parameter p k = 1=k, k = 1; 2;:::. LetS k = P k i=1 X i denote the arrival time of the k th event. Using the results on the convolution of geometric distributions (see Chen and Guisong [69]), the distribution of S k can be derived as P (S k =j) = k Y i=1 p i X c 1 +:::+c k =jk (1p 1 ) c 1 (1p k ) c k where c 1 ;:::;c k are non-negative integers. Now, because P (N(n)k) =P (S k n), the expected number of new arms played in the rst n 77 games under the r-policy is E[N(n)] = n X k=1 P (N(n)k) = n X k=1 P (S k n) = n X k=1 n X j=1 0 @ k Y i=1 p i X c 1 ++c k =jk (1p 1 ) c 1 (1p k ) c k 1 A 6.2 When the reward distribution is known When the reward distributionF is known, we consider a policy that in each game plays the best old armx with probabilityF (x), and plays a new arm with probability 1F (x). This policy is referred to as the x-policy. Note that the exploration probability of this policy is equal to the probability that a new arm is better than the best old arm, which is a similar idea as the Thompson sampling policy for the classic stochastic bandit problem. 6.3 Performance evaluation of the heuristics We use simulation to evaluate the performances of the heuristic policies. We consider two cases of the reward distribution F : Uniform(0; 1) and Exponential(1). Case 1: F is Uniform(0; 1) Figure 6.1 compares the expected average reward per game of the -policy, the n-policy, the k-policy and the x-policy when the reward distribution is Uniform(0; 1). Table 6.1 shows the mean and variance of the sum of rewards under the four policies. The simulation results are based on 10,000 repetition runs. 78 Figure 6.1: Expected average reward per game when F is Uniform(0; 1) -policy n-policy r-policy x-policy N Mean Variance Mean Variance Mean Variance Mean Variance 10 5.2796 7.46125 6.13828 4.31907 6.19959 2.90157 6.82442 2.6407 50 31.7001 125.168 36.6208 80.6474 38.8204 25.7587 40.9213 21.0171 100 70.7611 325.744 76.9445 281.911 83.0135 60.3351 86.2822 47.8171 200 156.883 653.428 159.904 967.821 174.618 137.313 179.599 105.832 500 432.931 1218.16 415.108 4953.7 457.661 394.761 466.172 286.281 1000 907.591 1698.91 848.159 16911.9 938.456 837.27 950.925 588.407 5000 4777.5 2904.47 4391.87 299327 4856.24 4616.24 4885.08 3168.13 10000 9639.4 3475.72 8880.69 1.03537e+6 9794.09 9604.5 9835.51 6452.47 Table 6.1: Sum of rewards when F is Uniform(0; 1) 79 From the experiment results, we observe that the x-policy, which requires the knowledge of the reward distribution F , always has the best performance. On the other hand, among the three policies that do not require F , ther-policy always performs the best, and is almost as good as the x-policy when the horizon is large. Regarding the -policy and the n-policy, we observe that the n-policy performs better when the horizon is small (less than 200 games), and the-policy performs better when the horizon is large. The reason of the poor performance of the n-policy over a large horizon is that the policy does not explore enough new arms, as the exploration probability 1=n decreases rapidly in time. On the other hand, from Table 6.1 we observe that the sum of rewards under the n-policy has a large variance. Case 2: F is Exponential(1) Figure 6.2 compares the expected average reward per game of the four policies when the reward distribution is Exponential(1). Table 6.2 shows the mean and variance of the sum of rewards under the four policies. The simulation results are based on 10,000 repetition runs. 80 Figure 6.2: Expected average reward per game when F is Exponential(1) -policy n-policy r-policy x-policy N Mean Variance Mean Variance Mean Variance Mean Variance 10 10.82 100.03 13.53 81.80 14.14 67.00 16.28 89.68 50 72.22 2581.53 91.22 2847.57 108.02 2160.32 117.13 2311.24 100 176.78 1.03e+4 199.76 1.24e+4 252.32 9444.65 267.97 9238.48 200 438.24 4.03e+4 430.30 5.30e+4 578.69 4.04e+4 606.43 3.69e+4 500 1439.45 2.39e+5 1162.87 3.49e+5 1691.76 2.67e+5 1752.75 2.30e+5 1000 3463.43 9.34e+5 2441.39 1.43e+6 3751.13 1.11e+6 3867.00 9.17e+5 5000 2.46e+4 2.26e+7 1.34e+4 3.76e+7 2.30e+4 2.90e+7 2.34e+4 2.26e+7 10000 5.57e+4 8.98e+7 2.76e+4 1.53e+8 4.96e+4 1.17e+8 5.05e+4 8.93e+7 Table 6.2: Sum of rewards when F is Exponential(1) 81 Surprisingly, we observe that the -policy has the best performance when N is large. The reason is that with the reward distribution being an unbounded exponential distribution, the other three policies play too few new arms. It is worth mentioning that although the exploration probability of the x-policy is the probability of a new arm being better than the best old arm x, the policy does not take into account the potential improvement on the best old value by playing a new arm. Thus, thex-policy may underrate the potential improvement by playing a new arm, especially when F is unbounded like exponential. Consequently, the x-policy does not play enough new arms, and thus does not perform as well as the -policy over a large horizon. 82 Chapter 7 When the reward distribution has un- known parameters In previous chapters, we considered problems with a known reward distribution, and modeled the problems as stochastic dynamic programming problems. In this chapter, we consider the case where the reward distributionF is not completely known, with the parametric form of F being specied up to a set of unknown parameters. In Section 7.1, we propose several probabilistic policies using the Bayesian method. In Section 7.2, we introduce the ducial method, and propose several ducial probabilistic policies when F is a normal distribution with known variance 2 and unknown mean. In Section 7.3, we evaluate the performances of the heuristic policies using simulation. In Section 7.4, we consider the case where N is geometric, and show that the optimal policy is a stopping policy. We also present a procedure for constructing good heuristic policies. Finally, in Section 7.5, we show that when N is deterministic, the optimal policy is a stopping policy. 83 7.1 Bayesian policies for general reward distributions In this section, we suppose that F is a general distribution with unknown parameters . We consider dierent cases of the horizon N, and propose several Bayesian policies, which assume a prior distribution of and make decisions based on the posterior distribution. 7.1.1 When N is unknown: the Bayesian probabilistic policy In Chapter 6, we considered the problem where N is unknown but F is completely known. We proposed the x-policy, which in each game plays a new arm with probability 1F (x), where x is the value of the best old arm so far. Simulation results showed that the x-policy had a good empirical performance. When F has unknown parameters, we also want to construct a probabilistic policy using the idea of the x-policy. However, the exploration probability 1F (x) cannot be directly obtained in this case. Instead, we use the Bayesian method to obtain a posterior distribution of the unknown parameters, and compute the exploration probabilities based on the posterior distribution. The policy is summarized as follows. 84 Policy The Bayesian probabilistic policy Input: K = number of arms, p() = prior distribution of 1: Initialization: Set x = 0 and r = 0 2: for n = 1; 2;::: do 3: if no remaining new arms then 4: Play arm x 5: else 6: Generate ^ from distribution p(jx 1 ;x 2 ;:::;x r ) 7: With probability F (xj ^ ) play arm x; otherwise play a new arm x r+1 , and update x = maxfx;x r+1 g, posterior distribution p(jx 1 ;x 2 ;:::;x r+1 ), and r =r + 1 7.1.2 When N is IFR: the Bayesian probabilistic one-stage lookahead policy In Chapter 3, we proved that when N is IFR and F is completely known, the optimal policy is the one-stage lookahead policy, which in state (n;x;k) plays arm x if x +xE[N n ] E[X] + E[maxfx;Xg]E[N n ] and plays a new arm otherwise, where X is a random variable following dis- tribution F . When F has unknown parameters , the one-stage lookahead policy cannot be directly applied, because the values of E[X] and E[maxfx;Xg] cannot be directly computed. One approach to solve this problem is to use statistics methods (for example the Bayesian method) to estimate the unknown parameters, and then apply the one-stage lookahead procedure based on the estimates. This policy is a stopping policy, because the estimates are not updated when old arms are played, and x +xE[N n ]E[X]E[maxfx;Xg]E[N n ] is increasing in n when N is IFR. The problem of this policy is that when the estimates are far away from the actual values of the parameters (which is likely to occur when only a few arms have been played), the policy may stop exploring new arms even if the best old value is not good enough. 85 A better approach is to use the one-stage lookahead procedure to construct a probabilistic policy. For each , let X() denote a random variable following distribution F (j). For a game n, let x 1 ;x 2 ;:::;x r(n) denote the old values seen before that game, and let x = maxfx 1 ;x 2 ;:::;x r(n) g. Let S n =f :x +xE[N n ]E[X()] +E maxfx;X()g E[N n ]g and P n =P S n x 1 ;:::;x r(n) Then P n is the conditional probability that stopping in game n is at least as good as continuing for exactly one more period and then stopping. We consider a policy that plays the best old arm with probability P n , and plays a new arm otherwise. When implementing this policy, we do not need to compute the value of P n . Instead, we can generate a value ^ from the conditional distribution p(jx 1 ;x 2 ;:::;x r(n) ), and play the best old arm x = maxfx 1 ;x 2 ;:::;x r(n) g if x +xE[N n ]E[X( ^ )] +E[maxfx;X( ^ )g]E[N n ] The conditional distributionp(jx 1 ;x 2 ;:::;x r(n) ) can be obtained using the Bayesian method. The policy is summarized as follows. 86 Policy The Bayesian probabilistic one-stage lookahead policy Input: K = number of arms, p() = prior distribution of 1: Initialization: Set x = 0 and r = 0 2: for n = 1; 2;::: do 3: if no remaining new arms then 4: Play arm x 5: else 6: Generate ^ from distribution p(jx 1 ;x 2 ;:::;x r ) 7: if x +xE[N n ]E[X( ^ )] +E[maxfx;X( ^ )g]E[N n ] then 8: Play arm x 9: else 10: Play a new arm x r+1 11: Updatex = maxfx;x r+1 g, posterior distribution p(jx 1 ;x 2 ;:::;x r+1 ), andr =r + 1 7.2 Fiducial policies for normal reward distributions In the previous section, the distribution of the unknown parameters is estimated using the Bayesian method, which needs to assume a prior distribution of . The quality of the estimates may depend on the choice of the prior, especially when the estimates are based on a small number of samples. In this section, we introduce an alternative method for estimating , called the ducial method, which does not need to assume a prior distribution of . The ducial method was initially proposed and studied by Fisher in the 1930s ([70], [71], [72], [73]). According to Fisher in [71], the ducial method aims to derive \rigorous probability statements about the unknown parameters of the population from which the observational data are a random 87 sample, without the assumption of any knowledge respecting their probability distributions a pri- ori". In a review paper [74] in 1992, Zabell described the intuition of the ducial argument as \the passage from a probability assertion about a statistic (conditional on a parameter) to a probabil- ity assertion about a parameter (conditional on a statistic)". For more literature on the ducial method, see Lindley [75], Stein [76], Barnard [77], Fraser [78] and [79], and Hora and Buehler [80]. Suppose that F is a normal distribution with unknown mean and known variance 2 . Suppose that X i , i = 1; 2;:::;r are i.i.d random variables following distribution F , and we have observed valuesX i =x i ,i = 1; 2;:::;r. The ducial method gives a conditional distribution of the parameter using the following procedure. For a given , X i , i = 1; 2;:::;r are random variables such that 1 r r X i=1 X r N(0; 2 r ) Now, given the observations X i = x i , i = 1; 2;:::;r, the ducial method assumes that becomes a random variable such that 1 r r X i=1 x r N(0; 2 r ) or, equivalently, N( 1 r r X i=1 x r ; 2 r ) Using this procedure, the ducial method gives a conditional distribution of the parameter given the observations X i =x i , i = 1; 2;:::;r, without assuming a prior distribution of . Although the ducial method is not widely accepted (see [74]), it provides us a procedure to estimate the distributions of the unknown parameters, which can be useful for constructing bandit policies. In this section, we suppose that the reward distribution F is normal with unknown mean and known variance 2 , and propose several probabilistic policies using the ducial method. 88 7.2.1 When N is unknown: the ducial probabilistic policy Similar to the Bayesian probabilistic policy, we combine the ducial method with the idea of the x-policy, and construct the following ducial probabilistic policy. Policy The ducial probabilistic policy Input: K = number of arms 1: Initialization: Set x = 0 and r = 0 2: for n = 1; 2;::: do 3: if no remaining new arms then 4: Play arm x 5: else 6: Generate ^ from the normal distribution N( P r i=1 x r =r; 2 =r) 7: With probability F (xj ^ ) play arm x; otherwise play a new arm x r+1 , and update x = maxfx;x r+1 g and r =r + 1 7.2.2 When N is IFR: the ducial probabilistic one-stage lookahead policy We combine the ducial method with the one-stage lookahead procedure and the idea of proba- bilistic policies, and construct the following ducial probabilistic one-stage lookahead policy. 89 Policy The ducial probabilistic one-stage lookahead policy Input: K = number of arms 1: Initialization: Set x = 0 and r = 0 2: for n = 1; 2;::: do 3: if no remaining new arms then 4: Play arm x 5: else 6: Generate ^ from the normal distribution N( P r i=1 x r =r; 2 =r) 7: if x +xE[N n ]E[X( ^ )] +E[maxfx;X( ^ )g]E[N n ] then 8: Play arm x 9: else 10: Play a new arm x r+1 11: Update x = maxfx;x r+1 g and r =r + 1 Remark: The ducial method can also be applied in the classic stochastic bandit problem and the contextual bandit problem, etc. 7.3 Performance evaluation of the heuristics In this section, we use simulation to evaluate the performances of the heuristic policies. Suppose thatF is a normal distribution with known variance 1 and unknown mean. For Bayesian policies, assume that has a standard normal prior. For each case of the horizon, we consider two cases of the value of : when is close to its prior mean, and when is far away from its prior mean. 90 7.3.1 The policies when N is unknown First, we evaluate the performances of the Bayesian probabilistic policy and the ducial probabilistic policy, which are proposed for the case when the horizon N is unknown. The policies are required to perform well for any value of N. When is close to its prior mean Figure 7.1 compares the expected average reward per game of the two policies for dierent values of N, when = 0:5. We observe that in the very beginning the Bayesian probabilistic policy performs slightly better than the ducial probabilistic policy, but afterwards the ducial probabilistic policy performs much better than the Bayesian probabilistic policy. Figure 7.1: Expected average reward per game when = 0:5 91 When is far away from its prior mean Figure 7.2 compares the expected average reward per game of the two policies for dierent values of N, when = 3. We observe that the ducial probabilistic policy always outperforms the Bayesian probabilistic policy. Figure 7.2: Expected average reward per game when = 3 7.3.2 The policies when N is IFR Second, we evaluate the performances of the two policies proposed for the case when N is IFR: the Bayesian probabilistic one-stage lookahead policy and the ducial probabilistic one-stage lookahead policy. We also compare them with the Bayesian probabilistic policy and the ducial probabilistic policy. Moreover, as a benchmark, we list the maximum expected sum of rewards when the value of is known, which is an upper bound of the maximum expected sum of rewards when is unknown. 92 When is close to its prior mean The following table compares the expected sum of rewards of the policies for dierent values of deterministic N, when = 0:5. N Bayesian probabilistic Fiducial probabilistic Bayesian lookahead Fiducial lookahead Upper bound on maximum 10 10.51 9.26 12.12 10.93 13.02 20 24.28 24.42 29.74 28.05 31.47 50 70.03 78.14 92.62 89.75 95.81 100 149.23 173.26 212.77 208.81 216.41 200 311.63 374.53 474.18 467.50 480.05 Table 7.1: Expected sum of rewards when N is deterministic and = 0:5 The following table compares the expected sum of rewards of the policies for geometric horizon with dierent parameter , when = 0:5. Bayesian probabilistic Fiducial probabilistic Bayesian lookahead Fiducial lookahead Upper bound on maximum 0.02 72.33 82.41 96.62 93.01 99.26 0.05 26.03 27.24 31.79 29.90 33.18 Table 7.2: Expected sum of rewards when N is geometric and = 0:5 We observe that in all cases, the Bayesian probabilistic one-stage lookahead policy performs slightly better than the ducial probabilistic one-stage lookahead policy, and both of them have expected sum of rewards close to the upper bound on maximum, implying that both policies have close-to- optimal performances. The Bayesian probabilistic policy and the ducial probabilistic policy, on 93 the other hand, do not perform very well, especially the Bayesian probabilistic policy. When is far away from its prior mean The following table compares the expected sum of rewards of the policies for dierent values of deterministic N, when = 3. N Bayesian probabilistic Fiducial probabilistic Bayesian lookahead Fiducial lookahead Upper bound on maximum 10 30.86 34.23 33.71 36.06 38.01 20 62.84 74.62 74.25 78.04 81.47 50 162.13 203.12 210.54 214.66 220.87 100 332.32 423.27 452.44 457.93 466.62 200 680.46 874.45 964.43 967.08 979.58 Table 7.3: Expected sum of rewards when N is deterministic and = 3 The following table compares the expected sum of rewards of the policies for geometric horizon with dierent parameter , when = 3. Bayesian probabilistic Fiducial probabilistic Bayesian lookahead Fiducial lookahead Upper bound on maximum 0.02 164.06 207.66 214.82 218.94 224.26 0.05 63.55 77.82 79.10 79.81 83.18 Table 7.4: Expected sum of rewards when N is geometric and = 3 We observe that in all cases, the ducial probabilistic one-stage lookahead policy performs slightly better than the Bayesian probabilistic one-stage lookahead policy, and both of them have expected 94 sum of rewards close to the upper bound on maximum, implying that both policies have close-to- optimal performances. The Bayesian probabilistic policy and the ducial probabilistic policy, on the other hand, does not perform very well, especially the Bayesian probabilistic policy. 7.4 When N is geometric: structure of optimal policy Suppose that the horizonN follows a geometric distribution with parameter. Because the horizon is geometric and the information about is only updated after new arms are played, it follows that once the optimal policy plays an old arm, the policy should keep playing that arm throughout. Consequently, the optimal policy is a stopping policy. Suppose that game n is about to be played, and we have observed old values x 1 ;x 2 ;:::;x r . To determine whether the policy should stop or continue playing new arms, we rst note that the expected sum of rewards by playing the best old arm x = maxfx 1 ;x 2 ;:::;x r g throughout is x=. Thus, the optimal policy should not stop, if x= is less than the expected sum of rewards under any alternative policies. The rst alternative is to play new arms until a value ofc is reached, and from then on keep playing the best old arm. This policy is referred to as a constant threshold policy (or c-policy). Supposing that has a prior distribution p(), the expected sum of rewards under this policy (see Section 5.1.2) is V (c) = Z +1 1 E[X()j] + (1)E[X()j;X()c](1F (cj))= 1 (1)F (cj) p(jx 1 ;:::;x r )d wherep(jx 1 ;:::;x r ) is the posterior distribution of given the observationsx 1 ;:::;x r , andX() is a random variable following distribution F (j). The optimal policy should not stop if x=<V (c) for any c. The second alternative is to play m new arms and then play the best old arm throughout, which 95 is usually referred to as the m-stage lookahead procedure (or the static m-policy). The expected sum of rewards under this policy (see Section 5.1.1) is V (m) = Z +1 1 E[X 1 ()j] + (E[max i=1;:::;m X i ()j]E[X 1 ()j]) (1) m p(jx 1 ;:::;x r )d where X 1 ();:::;X m () are i.i.d. random variables following distribution F (j). The optimal policy should not stop if x=<V (m) for any m less than or equal to k, the number of remaining new arms. We can also come up with many other alternatives, and the optimal policy should not stop if any of them is better than stopping. It is dicult to obtain the best policy by enumerating all alternatives. However, we can construct a good heuristic policy by considering some of the alternatives, and letting the player play a new arm if any alternative is better than stopping, and stop otherwise. 7.5 When N is deterministic: structure of optimal policy In this section, we consider the case when N is deterministic, and show that the optimal policy is a stopping policy. Let the state be (n;k;x 1 ;x 2 ;:::;x r ) where n is the game about to be played, k is the number of remaining new arms, and x 1 ;x 2 ;:::;x r are the old values seen so far. Note that we always have r +k =K where K is the total number of arms. We show that once the optimal policy plays an old arm, the optimal policy should also play that arm throughout the remaining games. Equivalently, we show that if it is optimal to play the best old arm x = maxfx 1 ;x 2 ;:::;x r g in state (n;k;x 1 ;x 2 ;:::;x r ), it is also optimal to play x in state (n + 1;k;x 1 ;x 2 ;:::;x r ). For any policy , let V (n;k;x 1 ;x 2 ;:::;x r ) denote the expected sum of rewards under policy from state (n;k;x 1 ;x 2 ;:::;x r ) onward. Also, let V (n;k;x 1 ;x 2 ;:::;x r ) denote the maximum expected sum of rewards from state (n;k;x 1 ;x 2 ;:::;x r ) onward. 96 Proposition 18. Suppose that N is deterministic. For any n < N, if it is optimal to play x = maxfx 1 ;x 2 ;:::;x r g in state (n;k;x 1 ;x 2 ;:::;x r ), it is also optimal to play x in state (n + 1;k;x 1 ;x 2 ;:::;x r ). Proof. Let u = inffy :F (y) = 1g (u could be innity). If x =u, it is immediate that it is optimal to always play x. Now suppose that x < u. Consider any policy 1 that plays x in state (n;k;x 1 ;x 2 ;:::;x r ) and plays a new arm in state (n + 1;k;x 1 ;x 2 ;:::;x r ). Then V 1 (n;k;x 1 ;:::;x r ) =x +E[X] +E[V 1 (n + 2;k 1;x 1 ;:::;x r ;X)] where X is a random variable following distribution F . Let 2 be a policy that plays a new arm in state (n;k;x 1 ;x 2 ;:::;x r ), plays the best old arm in game n + 1, and from game n + 2 onward makes the same decisions as policy 1 . Then V 2 (n;k;x 1 ;:::;x r ) =E[X] +E[maxfx;Xg] +E[V 1 (n + 2;k 1;x 1 ;:::;x r ;X)] >x +E[X] +E[V 1 (n + 2;k 1;x 1 ;:::;x r ;X)] =V 1 (n;k;x 1 ;:::;x r ) implying that 1 is not optimal. Therefore, for any n < N, if the optimal policy plays x in state (n;k;x 1 ;x 2 ;:::;x r ), the optimal policy should also play x in state (n + 1;k;x 1 ;x 2 ;:::;x r ). 97 Chapter 8 When the player is not allowed to play abandoned arms In this chapter, we assume that the player is not allowed to play abandoned arms, where an arm is said to be abandoned if it is not reused in the next game after its initially play. Under this assumption, in each game the player is only allowed to play either a new arm or the arm used in the last game. Suppose that the reward distribution F is known. The problem can be modeled as a stochastic dynamic programming problem with the state being (n;x;k) where n is the game about to be played,x is the value of the arm used in the last game andk is the number of remaining new arms. This chapter is organized as follows. In Section 8.1, we show that when the horizon N follows a general distribution, the optimal policy has a threshold structure and the optimal threshold values c(n;k) increase in k. In Section 8.2, we show that when N is IFR, the optimal threshold values c(n;k) decrease in n and thus the optimal policy is a threshold stopping policy. 98 8.1 When N follows a general distribution Let V new (n;x;k) and V old (n;x;k) denote the respective maximum expected sum of rewards from state (n;x;k) onward, given that a new arm or the best old arm is played in gamen. LetV (n;x;k) denote the maximum expected sum of rewards from state (n;x;k) onward. Then V (n;x;k) = maxfV new (n;x;k);V old (n;x;k)g V old (n;x;k) =x +V (n + 1;x;k)P (N n 1) V new (n;x;k) = 8 > > > < > > > : +E[V (n + 1;X;k 1)]P (N n 1); if k 1 0; if k = 0 where X is a random variable following distribution F . First, note that for any process, the maximum expected sum of rewards under the constraint that the player is not allowed to play abandoned arms, is less than or equal to the maximum expected sum of rewards without that constraint. Thus, it follows from Proposition 2 that when the player is not allowed to play abandoned arms, ifE[N 2 ]<1, then for any state (n;x;k),V (n;x;k)<1. Proposition 19. If E[N 2 ]<1, then for any state (n;x;k), V (n;x;k)<1. The following proposition shows that the maximum expected sum of rewards V (n;x;k) increases in x, the value of the best old arm, and k, the number of remaining new arms. Proposition 20. V (n;x;k) increases in x and k. Proof. The result immediately follows from a coupling argument. Now we show that the optimal policy has a threshold structure. Proposition 21. If it is optimal to play x in state (n;x;k), it is also optimal to play x 0 in state (n;x 0 ;k) for any x 0 x. 99 Proof. The result is immediate as V new (n;x 0 ;k) =V new (n;x;k). Proposition 21 implies that there exist threshold values c(n;k);n 1; 0 k K such that the optimal action in state (n;x;k) is to play x if x c(n;k) and to play a new arm otherwise. The threshold values c(n;k) are referred to as the optimal threshold values. It can be shown that the optimal threshold values c(n;k) increase in k, the number of remaining new arms. Proposition 22. The optimal threshold values c(n;k) increase in k. The proof of Proposition 22 is similar to the proof of Proposition 5, and is omitted here for simplicity. Remark: The optimal threshold values c(n;k) depend on k (for k 1). For example, when N = 2 and F is Uniform(0; 1), we have c(1; 1) = 0:5 whereas c(1; 2) = 0:5625. 8.2 When N is IFR We show that when N is IFR, the optimal policy is a threshold stopping policy. However, before doing so, we need the following lemma. Lemma 23. Suppose that a process starts from an arbitrary state (n 0 ;x 0 ;k 0 ) and employs an arbitrary threshold stopping policyfc(i;j);in 0 ; 0jk 0 g. Let A(n) denote the reward earned in game n of this process, nn 0 . Then E[A(n)] increases in n. The proof of Lemma 23 is similar to the proof of Lemma 7, and is omitted here for simplicity. With Lemma 23, we can show that the optimal policy is a threshold stopping policy. 100 Proposition 24. When N is IFR, if it is optimal to play x in state (n;x;k), it is also optimal to play x in state (n + 1;x;k). That is, the optimal policy is a threshold stopping policy. The proof of Proposition 24 is similar to the proof of Proposition 8, and is omitted here for simplicity. Remark: Proposition 24 implies that the optimal threshold valuesc(n;k) =V new (n; 0;k)=(1+E[N n ]). Unlike in Chapter 3 (where the player is allowed to play abandoned arms), the optimal threshold values cannot be obtained by using the one-stage lookahead procedure, as the policy need not be optimal. Let B denote the set of states where stopping immediately is at least as good as continuing for exactly one more period and then stopping. Then B =f(n;x;k) :k = 0g[f(n;x;k) :k 1;x +xE[N n ] +E[N n ]g =f(n;x;k) :k = 0g[f(n;x;k) :k 1;xg However, B is not a closed set of states, because by playing a new arm, a state (n;x;k) can make transition to state (n + 1;x 0 ;k 1) for any x 0 in the support of distribution F . Therefore, the one-stage lookahead policy need not be optimal in this case. It is usually dicult to obtain the maximum expected sum of rewards V (n;x;k). However, in a special case where the horizon N is deterministic, the maximum expected sum of rewards can be recursively computed by V (n;x;k) = maxfx(Nn + 1); +E[V (n + 1;X;k 1)]g; 1nN 1; 1kK with boundary conditions V (n;x; 0) =x(Nn + 1); 1nN V (N;x;k) = maxfx;g; 1kK The optimal threshold values c(n;k) =V new (n; 0;k)=(Nn + 1) in this case. 101 Chapter 9 Conclusions and future directions This work studied the multi-armed bandit problems with learned rewards. Dierent cases of the reward distributionF and the horizonN were considered. The structure of the optimal policy was characterized in general, and the optimal policy was given whenN has increasing failure rate (IFR). When the optimal policy is dicult to obtain, high-performance and easy-to-implement heuristic policies were proposed and evaluated using simulation. First, we considered the cases with a known reward distribution F . We showed that the optimal policy always has a threshold structure, and the optimal threshold values c(n;k) increase in k. When N is IFR, we further showed that the optimal policy is a threshold stopping policy, with the optimal threshold values given by a one-stage lookahead procedure. In addition, we derived an expression for the maximum expected sum of rewards, and presented ecient simulation algorithms using control variable and post-stratication techniques when the expression is dicult to compute analytically. The preceding results were also shown in the innitely-many-arm problem. When N follows a general distribution, we proposed ve heuristic policies: the m-policy, the c-policy, the (c;m)-policy, the one-stage lookahead policy and the mixture threshold policy. The rst three policies are easy to implement, and the last four policies performed well in simulation. Specially, 102 when N follows a mixture of geometric distributions, we proposed a mixture of geometrics policy, which performed the best in this case. When N is unknown, we proposed three heuristics that do not require the knowledge of F (the -policy, the n-policy and the r-policy) and one policy that requires F (the x-policy). Simulation results showed that when F is Uniform(0; 1), the x-policy had the best performance, and the r-policy also had a close performance. Surprisingly, when F is Exponential(1), the -policy had the best performance over a long horizon. The reason was that with an unbounded reward distribution like exponential, the other three policies explored too few new arms. In summary, we characterized the structure of the optimal policy when N follows a known general distribution, obtained the optimal policy when N is IFR, and proposed high- performance and easy-to-implement heuristic policies when N is not IFR or the distribution of N is unknown. However, obtaining the optimal threshold values in the general case is still an open question. Second, we considered the cases where F has unknown parameters . When the distribution of N is unknown, we proposed two probabilistic policies: one using the Bayesian method to estimate the distribution of , and the other using the ducial method. Simulation results showed that the ducial probabilistic policy had a more robust performance, especially when the parameters are far away from their prior means. WhenN is IFR, we proposed two probabilistic one-stage lookahead policies (Bayesian and ducial). Simulation results showed that the policies are near-optimal, as their expected sum of rewards are close to an upper bound on the maximum. Moreover, when N is geometric, we showed that the optimal policy is a stopping policy, and presented heuristics for constructing the stopping criteria. Finally, we showed that when N is deterministic, the optimal policy is a stopping policy. In the future, more theoretical results, such as theoretical guarantees of the proposed heuristics, need to be established in general. Last but not least, we considered a variation of the problem where the player is not allowed to play abandoned arms. Supposing that the reward distribution F is known, we showed that when N follows a general distribution, the optimal policy has a threshold structure, and the optimal 103 threshold values c(n;k) increase in k. In addition, we showed that when N is IFR, the optimal policy is a threshold stopping policy. However, unlike the original problem, the one-stage lookahead policy need not be optimal. Moreover, because the threhold values c(n;k) depend on k even when N is IFR, the optimal policy in the innitely-many-arm case is still an open question. In the future, we are interested in extending the model and letting d arms to be played in each game. When F is known, the problem can be modeled as a stochastic dynamic programming problem, with the state being (n;k;x 1 ;x 2 ;:::;x d ) wheren is the game about to be played,k is the number of remaining new arms, and x i is the value of the i th best old arm, i = 1; 2;:::;d. We are interested in nding the optimal policy of this problem. 104 Bibliography [1] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Proceedings of the European Meeting of Statisticians), North-Holland, Amsterdam, 1972. [2] Peter Whittle. Multi-armed bandits and the gittins index. Journal of the Royal Statistical Society: Series B (Methodological), 42(2):143{149, 1980. [3] Richard Weber. On the gittins index for multiarmed bandits. The Annals of Applied Proba- bility, 2(4):1024{1033, 1992. [4] John N Tsitsiklis. A short proof of the gittins index theorem. The Annals of Applied Probability, 4(1):194{199, 1994. [5] Tze Leung Lai and Herbert Robbins. Asymptotically ecient adaptive allocation rules. Ad- vances in applied mathematics, 6(1):4{22, 1985. [6] Yahel David and Nahum Shimkin. Innitely many-armed bandits with unknown value dis- tribution. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 307{322. Springer, 2014. [7] Yahel David and Nahum Shimkin. Rened algorithms for innitely many-armed bandits with deterministic rewards. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 464{479. Springer, 2015. 105 [8] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235{256, 2002. [9] Wesley Cowan, Junya Honda, and Michael N Katehakis. Normal bandits of unknown means and variances: Asymptotic optimality, nite horizon regret bounds, and a solution to an open problem. arXiv preprint arXiv:1504.05823, 2015. [10] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285{294, 1933. [11] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39{1, 2012. [12] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249{2257, 2011. [13] Joannes Vermorel and Mehryar Mohri. Multi-armed bandit algorithms and empirical evalua- tion. In European conference on machine learning, pages 437{448. Springer, 2005. [14] Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014. [15] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the Amer- ican Mathematical Society, 58(5):527{535, 1952. [16] Rajeev Agrawal. Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054{1078, 1995. [17] Michael N Katehakis and Herbert Robbins. Sequential choice from several populations. Pro- ceedings of the National Academy of Sciences of the United States of America, 92(19):8584, 1995. [18] Tze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091{1114, 1987. 106 [19] Stephen J Herschkorn, Erol Pek oz, and Sheldon M Ross. Policies without memory for the innite-armed bernoulli bandit under the average-reward criterion. Probability in the Engi- neering and Informational Sciences, 10(1):21{28, 1996. [20] Olivier Capp e, Aur elien Garivier, Odalric-Ambrym Maillard, R emi Munos, and Gilles Stoltz. Kullback{leibler upper condence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516{1541, 2013. [21] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. Foundations and Trends R in Machine Learning, 11(1):1{96, 2018. [22] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [23] Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. preprint, 2018. [24] John Gittins, Kevin Glazebrook, and Richard Weber. Multi-armed bandit allocation indices. John Wiley & Sons, 2011. [25] John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Sta- tistical Society. Series B (Methodological), pages 148{177, 1979. [26] John C Gittins and David M Jones. A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika, 66(3):561{565, 1979. [27] John C Gittins and KD Glazebrook. On bayesian models in stochastic scheduling. Journal of Applied Probability, 14(3):556{565, 1977. [28] Yih Ren Chen and Michael N Katehakis. Linear programming for nite state multi-armed bandit problems. Mathematics of Operations Research, 11(1):180{183, 1986. [29] Michael N Katehakis and Arthur F Veinott Jr. The multi-armed bandit problem: decomposi- tion and computation. Mathematics of Operations Research, 12(2):262{268, 1987. 107 [30] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661{670. ACM, 2010. [31] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payo functions. In Proceedings of the Fourteenth International Conference on Articial Intelligence and Statistics, pages 208{214, 2011. [32] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payos. In International Conference on Machine Learning, pages 127{135, 2013. [33] Benedict C May, Nathan Korda, Anthony Lee, and David S Leslie. Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13(Jun):2069{ 2106, 2012. [34] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263{293, 2003. [35] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817{824. Citeseer, 2007. [36] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008. [37] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638{1646, 2014. [38] Aleksandrs Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533{2568, 2014. 108 [39] Liang Tang, Yexi Jiang, Lei Li, and Tao Li. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 73{80. ACM, 2014. [40] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pages 325{333, 2016. [41] Miroslav Dud k, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. arXiv preprint arXiv:1502.06362, 2015. [42] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287{298, 1988. [43] Richard R Weber and Gideon Weiss. On an index policy for restless bandits. Journal of Applied Probability, 27(3):637{648, 1990. [44] Keqin Liu and Qing Zhao. Indexability of restless bandit problems and optimality of whittle in- dex for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11):5547{ 5567, 2010. [45] Sudipto Guha, Kamesh Munagala, and Peng Shi. Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1):3, 2010. [46] Dimitris Bertsimas and Jos e Ni~ no-Mora. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operations Research, 48(1):80{90, 2000. [47] Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. In Advances in neural information processing systems, pages 273{280, 2009. [48] Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201{1208. ACM, 2009. 109 [49] Yanan Sui, Masrour Zoghi, Katja Hofmann, and Yisong Yue. Advancements in dueling bandits. In IJCAI, pages 5502{5510, 2018. [50] Bangrui Chen and Peter I Frazier. Dueling bandits with weak regret. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 731{739. JMLR. org, 2017. [51] Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on Learning Theory, pages 1141{1154, 2015. [52] Miroslav Dud k, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. arXiv preprint arXiv:1502.06362, 2015. [53] Erol A Pek oz, Sheldon M Ross, and Zhengyu Zhang. Dueling bandit problems. 2019. [54] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General frame- work and applications. In International Conference on Machine Learning, pages 151{159, 2013. [55] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN), pages 1{9. IEEE, 2010. [56] Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379{387, 2014. [57] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48{77, 2002. [58] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In focs, page 322. IEEE, 1995. 110 [59] Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pages 116{120, 2016. [60] Aleksandrs Slivkins. Introduction to multi-armed bandits. September. URL http://slivkins. com/work/MAB-book. pdf, 2017. [61] S ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1{122, 2012. [62] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall, 5:71{87, 1985. [63] Aditya Mahajan and Demosthenis Teneketzis. Multi-armed bandit problems. In Foundations and applications of sensor management, pages 121{151. Springer, 2008. [64] Michael N Katehakis and Cyrus Derman. Computing optimal sequential allocation rules in clinical trials. Lecture notes-monograph series, pages 29{39, 1986. [65] William H Press. Bandit solutions provide unied ethical models for randomized clinical trials and comparative eectiveness research. Proceedings of the National Academy of Sciences, 106(52):22387{22392, 2009. [66] Sof a S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benets and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [67] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500{522, 2017. [68] S.M. Ross. Simulation. Elsevier Science, 2012. 111 [69] Xu Chen and Chang Guisong. Exact distribution of the convolution of negative binomial random variables. Communications in Statistics-Theory and Methods, 46(6):2851{2856, 2017. [70] Ronald Aylmer Fisher. The concepts of inverse probability and ducial probability referring to unknown parameters. Proc. R. Soc. Lond. A, 139(838):343{348, 1933. [71] Ronald A Fisher. The ducial argument in statistical inference. Annals of eugenics, 6(4):391{ 398, 1935. [72] Ronald A Fisher. On a point raised by ms bartlett on ducial probability. Annals of Eugenics, 7(4):370{375, 1937. [73] RA Fisher. A note on ducial inference. The Annals of Mathematical Statistics, 10(4):383{388, 1939. [74] Sandy L Zabell. Ra sher and ducial argument. Statistical Science, 7(3):369{387, 1992. [75] Dennis V Lindley. Fiducial distributions and bayes' theorem. Journal of the Royal Statistical Society. Series B (Methodological), pages 102{107, 1958. [76] Charles Stein. An example of wide discrepancy between ducial and condence intervals. The Annals of Mathematical Statistics, 30(4):877{880, 1959. [77] George A Barnard. Some logical aspects of the ducial argument. Journal of the Royal Statistical Society. Series B (Methodological), pages 111{114, 1963. [78] Donald AS Fraser. On ducial inference. The Annals of Mathematical Statistics, 32(3):661{676, 1961. [79] Donald AS Fraser. The ducial method and invariance. Biometrika, 48(3/4):261{280, 1961. [80] Rajinder Bir Hora and Robert J Buehler. Fiducial theory and invariant estimation. The Annals of Mathematical Statistics, 37(3):643{656, 1966. 112 Appendix A Examples and proofs A.1 An example of c(n;k) depending on k when N is not IFR Suppose that the reward distribution F is known, and that the horizon N follows a known general distribution. In general, the optimal threshold valuesc(n;k) depend onk (fork 1). For example, suppose that the horizon N = 8 > > > < > > > : 1; with probability 0:9 3; with probability 0:1 andF is uniformly distributed between 0 and 1. First note that from game 2 onward, the optimal policy is the one-stage lookahead policy with optimal threshold values c(2;k) = 2 p 2 0:5858 for k 1 and c(2; 0) = 0. Now, consider the optimal threshold values in game 1. If there is one remaining new arm, it can be shown that c(1; 1) = 0:5132, and V (1; 0:5132; 1) = 0:6263. If there are two remaining new arms, consider the optimal decision in state (1; 0:5132; 2). If an old arm is played in game 1, we have V old (1; 0:5132; 2) = V (1; 0:5132; 1) = 0:6263, because from game 2 onward at most one new arm will be played (the optimal policy will not play a new arm in game 3 because we already have an 113 arm with value 0.5132). On the other hand, it can be shown that V new (1; 0:5132; 2) = 0:6321 > V old (1; 0:5132; 2), implying thatc(1; 2)> 0:5132 =c(1; 1), which is an example ofc(n;k) depending on k for k 1. A.2 Proof of Lemma 11 Proof. First note that A(j) = 8 > > > < > > > : X j ; 1jS 1 maxfX 1 ;X 2 ;:::;X S g; jS Now, x i, i = 2; 3;:::;K + 1. For ji 1, P (S =i;A(j)<t) =P (maxfX 1 ;X 2 ;:::;X i2 g<c i1 ; maxfX 1 ;X 2 ;:::;X i1 gc i ;X j <t) =P (X 1 <c i1 ;X 2 <c i1 ;:::;X i2 <c i1 ; maxfX 1 ;X 2 ;:::;X i1 gc i ;X j <t) =P (X 1 <c i1 ;X 2 <c i1 ;:::;X i2 <c i1 ;X j <t) P (X 1 <c i1 ;X 2 <c i1 ;:::;X i2 <c i1 ;X j <t; maxfX 1 ;:::;X i1 g<c i ) =P (X 1 <c i1 ;X 2 <c i1 ;:::;X i2 <c i1 ;X j <t) P (X 1 <c i ;X 2 <c i ;:::;X i1 <c i ;X j <t) = 8 > > > < > > > : F (c i1 ) i3 F (minft;c i1 g)F (c i ) i2 F (minft;c i g); 1ji 2 F (c i1 ) i2 F (t)F (c i ) i2 F (minft;c i g); j =i 1 (A.1) 114 For ji, P (S =i;A(j)<t) =P (S =i; maxfX 1 ;X 2 ;:::;X i1 g<t) =P (maxfX 1 ;X 2 :::;X i2 g<c i1 ; maxfX 1 ;X 2 ;:::;X i1 gc i ; maxfX 1 ;X 2 ;:::;X i1 g<t) =P (maxfX 1 ;X 2 :::;X i2 g<c i1 ; maxfX 1 ;X 2 ;:::;X i1 g<t) P (maxfX 1 ;X 2 :::;X i2 g<c i1 ; maxfX 1 ;X 2 ;:::;X i1 g<t; maxfX 1 ;X 2 ;:::;X i1 g<c i ) =P (X 1 < minft;c i1 g;X 2 < minft;c i1 g;:::;X i2 < minft;c i1 g;X i1 <t) P (X 1 < minft;c i g;X 2 < minft;c i g;:::;X i1 < minft;c i g) =F (minft;c i1 g) i2 F (t)F (minft;c i g) i1 (A.2) Combining (A.1) and (A.2) yields the result. 115
Abstract (if available)
Abstract
Suppose there are K arms of a slot machine, with arm i having a deterministic value v_i, i = 1, 2, …, K. The values v_i are assumed to be independently generated from a common known distribution F, and are initially unknown to the player. In each game, the player chooses an arm to play, learns the arm’s value and receives a reward equal to this value. The objective is to find a policy that maximizes the expected sum of rewards in N games, where N is a random variable following a known distribution. ❧ We model the preceding problem as a stochastic dynamic programming problem, with the state being (n, x, k), where n is the game about to be played, x is the value of the best old arm, and k is the number of remaining new arms. We show that the optimal policy is a threshold policy, meaning that there are threshold values c(n, k) such that the optimal action in state (n, x, k) is to play x if x >= c(n, k) and play a new arm if x < c(n, k). We also show that c(n, k) increases in k. When the horizon N has increasing failure rate (IFR), we show that c(n, k) decreases in n, and obtain their values using a one-stage lookahead procedure. In addition, we derive an expression for the maximum expected sum of rewards, and propose efficient simulation algorithms when that quantity is difficult to compute analytically. The preceding results are also shown when there are infinitely many arms. ❧ When N is not IFR, N is unknown, or F has unknown parameters, we use the preceding results to construct high-performance and easy-to-implement heuristic policies, and evaluate their performances using simulation. ❧ We also consider a variation of the problem where the player is only allowed to play either a new arm or the arm used in the last game. Supposing that F is known and the distribution of N is known, we show that the optimal policy is a threshold policy and the optimal threshold values c(n, k) increase in k. In addition, when N is IFR, we show that the optimal policy is a threshold stopping policy.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Some bandit problems
PDF
Queueing loss system with heterogeneous servers and discriminating arrivals
PDF
Bayesian optimal stopping problems with partial information
PDF
Empirical methods in control and optimization
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Essays on revenue management with choice modeling
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Stochastic games with expected-value constraints
PDF
Asymptotic analysis of the generalized traveling salesman problem and its application
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Machine learning in interacting multi-agent systems
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Do humans play dice: choice making with randomization
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
Exploiting diversity with online learning in the Internet of things
Asset Metadata
Creator
Cao, Yang
(author)
Core Title
Multi-armed bandit problems with learned rewards
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Publication Date
07/08/2019
Defense Date
05/01/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
fiducial probabilities,four way coupling,multi-armed bandits,OAI-PMH Harvest,one-stage lookahead policy,optimal stopping theory,threshold policies
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ross, Sheldon Mark (
committee chair
)
Creator Email
cao573@usc.edu,caoyang.usc@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-180362
Unique identifier
UC11660196
Identifier
etd-CaoYang-7534.pdf (filename),usctheses-c89-180362 (legacy record id)
Legacy Identifier
etd-CaoYang-7534.pdf
Dmrecord
180362
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Cao, Yang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
fiducial probabilities
four way coupling
multi-armed bandits
one-stage lookahead policy
optimal stopping theory
threshold policies