DocumentCode :
913562
Title :
The two-armed-bandit problem with time-invariant finite memory
Author :
Cover, Thomas M. ; Hellman, Martin E.
Volume :
16
Issue :
2
fYear :
1970
fDate :
3/1/1970 12:00:00 AM
Firstpage :
185
Lastpage :
195
Abstract :
This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below. Given are probability densities p_0 and p_1 , and two experiments A and B . It is not known which density is associated with which experiment. Thus the experimental outcome Y of experiment A is as likely to be distributed according to p_0 as it is to be distributed according to p_1 . It is desired to sequentially choose an experiment to be performed on the basis of past observations according to the algorithm T_n = f(T_{n-1}, e_n, Y_n), e_n = e(T_{n-1}) , where T_n \\in {1, 2, \\cdots , m} is the state of memory at time n, e_n \\in {A, B} is the choice of experiment, and Y_n , is the random variable observation. The goal is to maximize the asymptotic proportion r of uses of the experiment associated with density p_0 . Let l(y) = p_0 (y) / p_1 (y) , and let \\bar{l} and \\bar{\\bar{l}} denote the almost everywhere greatest lower bound and least upper bound on l(y) . Let 1 = \\max {\\bar{\\bar{l}}, 1/\\bar{l}} . Then the optimal value of r , over all m -state algorithms (f, e) , will be shown to be l^{m-1} / (l^{m-1} + 1) . An e -optimal family of m -state algorithms will be demonstrated. In general, optimal algorithms do not exist, and e -optimal algorithms require artificial randomization.
Keywords :
Finite-memory methods; Learning systems; Sequential decision procedures; Stochastic automata; Bayesian methods; H infinity control; Helium; Probability distribution; Random variables; Statistics; Testing; Upper bound;
fLanguage :
English
Journal_Title :
Information Theory, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9448
Type :
jour
DOI :
10.1109/TIT.1970.1054427
Filename :
1054427
Link To Document :
بازگشت