DocumentCode
404245
Title
Implementation of gradient estimation to a constrained Markov decision problem
Author
Krishnamurthy, Vikram ; Martin, Katerine ; Abad, Felisa Vizquez
Volume
5
fYear
2003
fDate
9-12 Dec. 2003
Firstpage
4841
Abstract
Consider the problem of a constrained Markov Decision Process (MDP). Under a parameterization of the control strategies, the problem can be transformed into a non-linear optimization problem with non-linear constraints. Both the cost and the constraints are stationary averages. We assume that the transition probabilities of the underlying Markov chain are unknown: only the values of the control variables are known, as well as the instantaneous values of the cost and the constraints, so no analytical expression for the stationary averages is available. To find the solution to the optimization problem, a stochastic version of a primal/dual method with an augmented Lagrangian is used. The updating scheme uses a "measure valued" estimator of the gradients that can be interpreted in terms of a finite horizon version of the Perturbation Analysis (PA) method known as the "perturbation realization factors". Most finite horizon derivative estimators are consistent as the sample size grows, so it is common to assume that large enough samples can be observed so as to make the bias negligible. This paper deals with the actual implementations of the gradient estimators in finite horizon with small sample sizes, so that the iterates of the stochastic approximation can be performed very often, as would be required for on-line learning. We identify the asymptotic bias of the stochastic approximation for the constrained optimization method, and by so doing we propose several means to correct it. As is very common with these problems, the bias correction introduces a conflict between precision and speed: the smaller the bias, the slower the reaction time. In the sequel, we present the theoretical basis for the study of bias and learning rate. Our experimental results indicate that smoothing at a faster time scale may not be necessary at all, only at a slower time scale. We include results where the algorithms have to track changes in the environment.
Keywords
Markov processes; approximation theory; decision theory; gradient methods; optimisation; perturbation techniques; Markov chain; augmented Lagrangian; bias correction; constrained Markov decision process; finite horizon derivative estimators; gradient estimation; nonlinear constraints; nonlinear optimization method; online learning; perturbation analysis method; perturbation realization factors; primal/dual method; stochastic approximation; Constraint optimization; Convergence; Costs; Electric variables control; Lagrangian functions; Optimal control; Optimization methods; Smoothing methods; State-space methods; Stochastic processes;
fLanguage
English
Publisher
ieee
Conference_Titel
Decision and Control, 2003. Proceedings. 42nd IEEE Conference on
ISSN
0191-2216
Print_ISBN
0-7803-7924-1
Type
conf
DOI
10.1109/CDC.2003.1272362
Filename
1272362
Link To Document