Average cost temporal-difference learning

Author

Tsitsiklis, John N. ; Van Roy, Benjamin

Author_Institution

Lab. for Inf. & Decision Syst., MIT, Cambridge, MA, USA

Volume

1

fYear

1997

fDate

10-12 Dec 1997

Firstpage

498

Abstract

We describe a variant of temporal-difference learning that approximates average and differential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of fixed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present results concerning convergence and the limit of convergence. We also provide a bound on the resulting approximation error that exhibits an interesting dependence on the “mixing time” of the Markov chain. The results parallel previous work by the authors (1997), involving approximations of discounted cost-to-go

Keywords

Markov processes; convergence; decision theory; dynamic programming; learning (artificial intelligence); approximation error; average cost temporal-difference learning; convergence; discounted cost-to-go; fixed basis functions; irreducible aperiodic Markov chain; Approximation error; Convergence; Cost function; Dynamic programming; Heuristic algorithms; Iterative algorithms; Laboratories; Markov processes; Poisson equations; State-space methods;

fLanguage

English

Publisher

ieee

Conference_Titel

Decision and Control, 1997., Proceedings of the 36th IEEE Conference on

Conference_Location

San Diego, CA

ISSN

0191-2216

Print_ISBN

0-7803-4187-2

Type

conf

DOI

10.1109/CDC.1997.650675

Filename

650675