Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards

Author

Anantharam, Venkatachalam ; Varaiya, Pravin ; Walrand, Jean

Author_Institution

Cornell University, Ithaca, NY, USA

Volume

32

Issue

11

fYear

1987

fDate

11/1/1987 12:00:00 AM

Firstpage

977

Lastpage

982

Abstract

At each instant of time we are required to sample a fixed number $m \\geq 1$ out of $N$ Markov chains whose stationary transition probability matrices belong to a family suitably parameterized by a real number $\\theta$ . The objective is to maximize the long run expected value of the samples. The learning loss of a sampling scheme corresponding to a parameters configuration $C = (\\theta_{1}, ..., \\theta_{N})$ is quantified by the regret $R_{n}(C)$ . This is the difference between the maximum expected reward that could be achieved if $C$ were known and the expected reward actually achieved. We provide a lower bound for the regret associated with any uniformly good scheme, and construct a sampling scheme which attains the lower bound for every $C$ . The lower bound is given explicitly in terms of the Kullback-Liebler number between pairs of transition probabilities.

Keywords

Adaptive control; Markov processes; Optimal stochastic control; Resource management; Stochastic optimal control; Arm; Computer science; Laboratories; Probability distribution; Random variables; Sampling methods; State-space methods; Statistical distributions; Statistics; Stochastic processes;

fLanguage

English

Journal_Title

Automatic Control, IEEE Transactions on

Publisher

ieee

ISSN

0018-9286

Type

jour

DOI

10.1109/TAC.1987.1104485

Filename

1104485