• DocumentCode
    854438
  • Title

    Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part I: I.I.D. rewards

  • Author

    Anantharam, Venkatachalam ; Varaiya, Pravin ; Walrand, Jean

  • Author_Institution
    Cornell University, Ithaca, NY, USA
  • Volume
    32
  • Issue
    11
  • fYear
    1987
  • fDate
    11/1/1987 12:00:00 AM
  • Firstpage
    968
  • Lastpage
    976
  • Abstract
    At each instant of time we are required to sample a fixed number m \\geq 1 out of N i.i.d, processes whose distributions belong to a family suitably parameterized by a real number \\theta . The objective is to maximize the long run total expected value of the samples. Following Lai and Robbins, the learning loss of a sampling scheme corresponding to a configuration of parameters C = (\\theta_{1},..., \\theta_{N}) is quantified by the regret R_{n}(C) . This is the difference between the maximum expected reward at time n that could be achieved if C were known and the expected reward actually obtained by the sampling scheme. We provide a lower bound for the regret associated with any uniformly good scheme, and construct a scheme which attains the lower bound for every configuration C . The lower bound is given explicitly in terms of the Kullback-Liebler number between pairs of distributions. Part II of this paper considers the same problem when the reward processes are Markovian.
  • Keywords
    Adaptive control; Optimal stochastic control; Resource management; Stochastic optimal control; Arm; Computer aided manufacturing; Computer science; Density measurement; Laboratories; Manufacturing systems; Resource management; Sampling methods; State-space methods; Statistics;
  • fLanguage
    English
  • Journal_Title
    Automatic Control, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9286
  • Type

    jour

  • DOI
    10.1109/TAC.1987.1104491
  • Filename
    1104491