مرکز منطقه ای اطلاع رساني علوم و فناوري - The two-armed-bandit problem with time-invariant finite memory

Abstract :

This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below. Given are probability densities $p_0$ and $p_1$ , and two experiments $A$ and $B$ . It is not known which density is associated with which experiment. Thus the experimental outcome $Y$ of experiment $A$ is as likely to be distributed according to $p_0$ as it is to be distributed according to $p_1$ . It is desired to sequentially choose an experiment to be performed on the basis of past observations according to the algorithm $T_n = f(T_{n-1}, e_n, Y_n), e_n = e(T_{n-1})$ , where $T_n \\in {1, 2, \\cdots , m}$ is the state of memory at time $n, e_n \\in {A, B}$ is the choice of experiment, and $Y_n$ , is the random variable observation. The goal is to maximize the asymptotic proportion $r$ of uses of the experiment associated with density $p_0$ . Let $l(y) = p_0 (y) / p_1 (y)$ , and let $\\bar{l}$ and $\\bar{\\bar{l}}$ denote the almost everywhere greatest lower bound and least upper bound on $l(y)$ . Let $1 = \\max {\\bar{\\bar{l}}, 1/\\bar{l}}$ . Then the optimal value of $r$ , over all $m$ -state algorithms $(f, e)$ , will be shown to be $l^{m-1} / (l^{m-1} + 1)$ . An $e$ -optimal family of $m$ -state algorithms will be demonstrated. In general, optimal algorithms do not exist, and $e$ -optimal algorithms require artificial randomization.