• DocumentCode
    3763563
  • Title

    Maximizing the average reward in episodic reinforcement learning tasks

  • Author

    Chris Reinke;Eiji Uchibe;Kenji Doya

  • Author_Institution
    Okinawa Institute of Science and Technology, Neural Computation Unit, Onna-son, Japan
  • fYear
    2015
  • Firstpage
    420
  • Lastpage
    421
  • Abstract
    We propose an ensemble method consisting of several Q-learning modules to optimize the average reward in episodic Markov decision processes (MDPs). It can be proven that the method learns and optimizes the average reward in MDPs where non-zero rewards are only given by transitions into goal states and the decision for a trajectory to a goal state is only possible in the start state. We introduced a sampling method for MDPs to show that the average reward can also be optimized to a high degree in MDPs which do not fulfill these conditions.
  • Keywords
    "Trajectory","Learning (artificial intelligence)","Robots","Human computer interaction","Markov processes","Sampling methods"
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Informatics and Biomedical Sciences (ICIIBMS), 2015 International Conference on
  • Type

    conf

  • DOI
    10.1109/ICIIBMS.2015.7439495
  • Filename
    7439495