DocumentCode
854438
Title
Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part I: I.I.D. rewards
Author
Anantharam, Venkatachalam ; Varaiya, Pravin ; Walrand, Jean
Author_Institution
Cornell University, Ithaca, NY, USA
Volume
32
Issue
11
fYear
1987
fDate
11/1/1987 12:00:00 AM
Firstpage
968
Lastpage
976
Abstract
At each instant of time we are required to sample a fixed number
out of
i.i.d, processes whose distributions belong to a family suitably parameterized by a real number
. The objective is to maximize the long run total expected value of the samples. Following Lai and Robbins, the learning loss of a sampling scheme corresponding to a configuration of parameters
is quantified by the regret
. This is the difference between the maximum expected reward at time
that could be achieved if
were known and the expected reward actually obtained by the sampling scheme. We provide a lower bound for the regret associated with any uniformly good scheme, and construct a scheme which attains the lower bound for every configuration
. The lower bound is given explicitly in terms of the Kullback-Liebler number between pairs of distributions. Part II of this paper considers the same problem when the reward processes are Markovian.
out of
i.i.d, processes whose distributions belong to a family suitably parameterized by a real number
. The objective is to maximize the long run total expected value of the samples. Following Lai and Robbins, the learning loss of a sampling scheme corresponding to a configuration of parameters
is quantified by the regret
. This is the difference between the maximum expected reward at time
that could be achieved if
were known and the expected reward actually obtained by the sampling scheme. We provide a lower bound for the regret associated with any uniformly good scheme, and construct a scheme which attains the lower bound for every configuration
. The lower bound is given explicitly in terms of the Kullback-Liebler number between pairs of distributions. Part II of this paper considers the same problem when the reward processes are Markovian.Keywords
Adaptive control; Optimal stochastic control; Resource management; Stochastic optimal control; Arm; Computer aided manufacturing; Computer science; Density measurement; Laboratories; Manufacturing systems; Resource management; Sampling methods; State-space methods; Statistics;
fLanguage
English
Journal_Title
Automatic Control, IEEE Transactions on
Publisher
ieee
ISSN
0018-9286
Type
jour
DOI
10.1109/TAC.1987.1104491
Filename
1104491
Link To Document