Title :
An actor critic algorithm based on Grassmanian search
Author :
Prabuchandran, K.J. ; Bhatnagar, Shalabh ; Borkar, Vivek S.
Author_Institution :
Dept. of Comput. Sci. & Autom., Indian Inst. of Sci., Bangalore, India
Abstract :
We propose the first online actor-critic scheme with adaptive basis to find a local optimal control policy for a Markov Decision Process (MDP) under the weighted discounted cost objective. We parameterize both the policy in the actor and the value function in the critic. The actor performs gradient search in the space of policy parameters using simultaneous perturbation stochastic approximation (SPSA) gradient estimates. This gradient computation requires estimates of value function that are provided by the critic by minimizing a mean square Bellman error objective. In order to obtain good estimates of the value function, the critic adaptively tunes the basis functions (or the features) to obtain the best representation of the value function using gradient search in the Grassmanian of features. Our control algorithm makes use of multi-timescale stochastic approximation. The actor updates its parameters along the slowest time scale. The critic uses two time scales to estimate the value function. For any given feature value, our algorithm performs gradient search in the parameter space via a residual gradient scheme on the faster timescale and, on a medium timescale, performs gradient search in the Grassman manifold of features. We provide an outline of the proof of convergence of our control algorithm to a locally optimum policy. We show empirical results using our algorithm as well as a similar algorithm that uses temporal difference (TD) learning in place of the residual gradient scheme for the faster timescale updates.
Keywords :
Markov processes; gradient methods; learning systems; mean square error methods; optimal control; perturbation techniques; search problems; stochastic systems; Grassman manifold; Grassmanian search; MDP; Markov decision process; SPSA gradient estimate; TD learning; actor critic algorithm; control algorithm; gradient computation; gradient search; local optimal control policy; mean square Bellman error objective; multitimescale stochastic approximation; online actor-critic scheme; policy parameter; residual gradient scheme; simultaneous perturbation stochastic approximation gradient estimate; temporal difference learning; value function; weighted discounted cost objective; Approximation algorithms; Convergence; Function approximation; Linear approximation; Linear programming; Vectors; Control; Grassman manifold; feature adaptation; online learning; residual gradient scheme; stochastic approximation; temporal difference learning;
Conference_Titel :
Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on
Conference_Location :
Los Angeles, CA
Print_ISBN :
978-1-4799-7746-8
DOI :
10.1109/CDC.2014.7039948