Approximate value iteration with randomized policies

Author

De Farias, D.P. ; Van Roy, B.

Author_Institution

Stanford Univ., CA, USA

Volume

4

fYear

2000

fDate

2000

Firstpage

3421

Abstract

The curse of dimensionality in dynamic programming prevents, in most problems of practical interest, the exact computation of the value function. We study the fixed points of approximate value iteration, a simple algorithm that combats the curse of dimensionality by generating approximate iterates of the classical value iteration algorithm in the span of a set of prespecified basis functions. We show that, in general, the modified dynamic programming operator need not possess a fixed point, and therefore, approximate value iteration should not be expected to converge. However, by using a class of randomized policies, approximate value iteration is guaranteed to possess at least one fixed point. We finally discuss the link between approximate value iteration and temporal-difference learning (TD), and show that the existence of fixed points for approximate value iteration implies existence of stationary points for the ordinary differential equation approximated by a version of TD that incorporates “exploration”

Keywords

Markov processes; dynamic programming; function approximation; iterative methods; learning (artificial intelligence); approximate value iteration; modified dynamic programming operator; randomized policies; temporal-difference learning; Control systems; Differential equations; Dynamic programming; State-space methods; Stochastic processes;

fLanguage

English

Publisher

ieee

Conference_Titel

Decision and Control, 2000. Proceedings of the 39th IEEE Conference on

Conference_Location

Sydney, NSW

ISSN

0191-2216

Print_ISBN

0-7803-6638-7

Type

conf

DOI

10.1109/CDC.2000.912232

Filename

912232