Distributed Policy Evaluation Under Multiple Behavior Strategies

Author

Valcarcel Macua, Sergio ; Jianshu Chen ; Zazo, Santiago ; Sayed, Ali H.

Author_Institution

Dept. of Signals, Syst. & Radiocommun., Univ. Politec. de Madrid, Madrid, Spain

Volume

60

Issue

5

fYear

2015

fDate

May-15

Firstpage

1260

Lastpage

1274

Abstract

We apply diffusion strategies to develop a fully-distributed cooperative reinforcement learning algorithm in which agents in a network communicate only with their immediate neighbors to improve predictions about their environment. The algorithm can also be applied to off-policy learning, meaning that the agents can predict the response to a behavior different from the actual policies they are following. The proposed distributed strategy is efficient, with linear complexity in both computation time and memory footprint. We provide a mean-square-error performance analysis and establish convergence under constant step-size updates, which endow the network with continuous learning capabilities. The results show a clear gain from cooperation: when the individual agents can estimate the solution, cooperation increases stability and reduces bias and variance of the prediction error; but, more importantly, the network is able to approach the optimal solution even when none of the individual agents can (e.g., when the individual behavior policies restrict each agent to sample a small portion of the state space).

Keywords

computational complexity; learning (artificial intelligence); mean square error methods; computation time; continuous learning capabilities; distributed policy evaluation; fully-distributed cooperative reinforcement learning algorithm; linear complexity; mean-square-error performance analysis; memory footprint; off-policy learning; Approximation algorithms; Equations; Linear approximation; Markov processes; Prediction algorithms; Vectors; Adaptive networks; Arrow-Hurwicz algorithm; diffusion strategies; distributed processing; gradient temporal difference; mean-square-error; reinforcement learning; saddle-point problem; saddlepoint problem;

fLanguage

English

Journal_Title

Automatic Control, IEEE Transactions on

Publisher

ieee

ISSN

0018-9286

Type

jour

DOI

10.1109/TAC.2014.2368731

Filename

6949624