Adaptive learning of uncontrolled restless bandits with logarithmic regret

Author

Tekin, Cem ; Liu, Mingyan

Author_Institution

Dept. of Electr. Eng. & Comput. Sci., Univ. of Michigan, Ann Arbor, MI, USA

fYear

2011

fDate

28-30 Sept. 2011

Firstpage

983

Lastpage

990

Abstract

In this paper we consider the problem of learning the optimal policy for the uncontrolled restless bandit problem. In this problem only the state of the selected arm can be observed, the state transitions are independent of control and the transition law is unknown. We propose a learning algorithm which gives logarithmic regret uniformly over time with respect to the optimal finite horizon policy with known transition law under some assumptions on the transition probabilities of the arms and the structure of the optimal stationary policy for the infinite horizon average reward problem.

Keywords

Markov processes; computational complexity; learning (artificial intelligence); adaptive learning algorithm; infinite horizon average reward problem; logarithmic regret; optimal finite horizon policy; optimal stationary policy; state transition; transition law; transition probability; uncontrolled restless bandit problem; Equations; History; Indexes; Markov processes; Mathematical model; Upper bound; Vectors;

fLanguage

English

Publisher

ieee

Conference_Titel

Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on

Conference_Location

Monticello, IL

Print_ISBN

978-1-4577-1817-5

Type

conf

DOI

10.1109/Allerton.2011.6120273

Filename

6120273