DocumentCode :
1942554
Title :
Online learning in opportunistic spectrum access: A restless bandit approach
Author :
Tekin, Cem ; Liu, Mingyan
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Univ. of Michigan, Ann Arbor, MI, USA
fYear :
2011
fDate :
10-15 April 2011
Firstpage :
2462
Lastpage :
2470
Abstract :
We consider an opportunistic spectrum access (OSA) problem where the time-varying condition of each channel (e.g., as a result of random fading or certain primary users´ activities) is modeled as an arbitrary finite-state Markov chain. At each instance of time, a (secondary) user probes a channel and collects a certain reward as a function of the state of the channel (e.g., good channel condition results in higher data rate for the user). Each channel has potentially different state space and statistics, both unknown to the user, who tries to learn which one is the best as it goes and maximizes its usage of the best channel. The objective is to construct a good online learning algorithm so as to minimize the difference between the user´s performance in total rewards and that of using the best channel (on average) had it known which one is the best from a priori knowledge of the channel statistics (also known as the regret). This is a classic exploration and exploitation problem and results abound when the reward processes are assumed to be iid. Compared to prior work, the biggest difference is that in our case the reward process is assumed to be Markovian, of which iid is a special case. In addition, the reward processes are restless in that the channel conditions will continue to evolve independent of the user´s actions. This leads to a restless bandit problem, for which there exists little result on either algorithms or performance bounds in this learning context to the best of our knowledge. In this paper we introduce an algorithm that utilizes regenerative cycles of a Markov chain and computes a samplemean based index policy, and show that under mild conditions on the state transition probabilities of the Markov chains this algorithm achieves logarithmic regret uniformly over time, and that this regret bound is also optimal. We numerically examine the performance of this algorithm along with a few other learning algorithms in the case of an OSA problem with G- - ilbert-Elliot channel models, and discuss how this algorithm may be further improved (in terms of its constant) and how this result may lead to similar bounds for other algorithms.
Keywords :
Internet; computer aided instruction; radio spectrum management; statistical analysis; time-varying channels; Gilbert-Elliot channel models; OSA problem; channel statistics; finite-state Markov chain; online learning; opportunistic spectrum access; restless bandit approach; time-varying channel; Context; Eigenvalues and eigenfunctions; Fading; Indexes; Markov processes; Probability; Silicon;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
INFOCOM, 2011 Proceedings IEEE
Conference_Location :
Shanghai
ISSN :
0743-166X
Print_ISBN :
978-1-4244-9919-9
Type :
conf
DOI :
10.1109/INFCOM.2011.5935068
Filename :
5935068
Link To Document :
بازگشت