مرکز منطقه ای اطلاع رساني علوم و فناوري - A sample path theory for time-average Markov decision processes

Abstract :

Considered are time-average Markov Decision Processes (MDPs) with finite state and action spaces. It is shown that the state space has a natural partition into strongly communicating classes and a set of states which is transient under all stationary policies. For every policy, any associated recurrent class must be a subset of one of the strongly communicating classes; moreover, there exists a stationary policy whose recurrent classes are the strongly communicating classes. A polynomial-time algorithm is given to determine the partition. The decomposition theory is utilized to investigate MDPs with a sample-path constraint. Here, both a cost and a reward are accumulated at each decision epoch. A policy is feasible if the time-average cost is below a specified value with probability one. The optimization problem is to maximize the expected average reward over all feasible policies. For MDPs with arbitrary recurrent structures, it is shown that there exists an ??-optimal stationary policy for each ?? > 0 if and only if there exists a feasible policy. Further, verifiable conditions are given for the existence of an optimal stationary policy.