• DocumentCode
    3744290
  • Title

    A restless bandit with no observable states for recommendation systems and communication link scheduling

  • Author

    Rahul Meshram;D. Manjunath;Aditya Gopalan

  • Author_Institution
    Deptt. of Elecl. Engg., IIT Bombay, Mumbai INDIA
  • fYear
    2015
  • Firstpage
    7820
  • Lastpage
    7825
  • Abstract
    A restless bandit is used to model a user´s interest in a topic or item. The interest evolves as a Markov chain whose transition probabilities depend on the action (display the ad or desist) in a time step. A unit reward is obtained if the ad is displayed and if the user clicks on the ad. If no ad is displayed then a fixed reward is assumed. The probability of click-through is determined by the state of the Markov chain. The recommender never gets to observe the state but in each time step it has a belief, denoted by πt, about the state of the Markov chain. πt evolves as a function of the action and the signal from each state. For the one-armed restless bandit with two states, we characterize the policy that maximizes the infinite horizon discounted reward. We first characterize the value function as a function of the system parameters and then characterize the optimal policies for different ranges of the parameters. We will see that the Gilbert-Elliot channel in which the two states have different success probabilities becomes a special case. For one special case, we argue that the optimal policy is of the threshold type with one threshold; extensive numerical results indicate that this may be true in general.
  • Keywords
    "Markov processes","History","Transmitters","Linear programming","Process control","Analytical models","Context modeling"
  • Publisher
    ieee
  • Conference_Titel
    Decision and Control (CDC), 2015 IEEE 54th Annual Conference on
  • Type

    conf

  • DOI
    10.1109/CDC.2015.7403456
  • Filename
    7403456