Compression of Optimal Value Functions for Markov Decision Processes

Author

Kochenderfer, Mykel J. ; Monath, N.

Author_Institution

Lincoln Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA

fYear

2013

fDate

20-22 March 2013

Firstpage

501

Lastpage

501

Abstract

Summary form only given. A Markov decision process (MDP) is defined by a state space, action space, transition model, and reward model. The objective is to maximize accumulation of reward over time. Solutions can be found through dynamic programming, which generally involves discretization, resulting in significant memory and computational requirements. Although computer clusters can be used to solve large problems, many applications require that solutions be executed on less capable hardware. We explored a general method for compressing solutions in a way that preserves fast random-access lookups. The method was applied to an MDP for an aircraft collision avoidance system. In our problem, S consists of aircraft positions and velocities and A consists of resolution advisories provided by the collision avoidance system, with S > 1.5 x 106, and A = 10. The solution to an MDP can be represented by an |S| x |A| matrix specifying Q*(s,a), the expected return of the optimal strategy from s after executing action a. Since, on average, only 6.6 actions are available from every state in our problem, it is more efficient to use a sparse representation consisting of an array of the permissible values of Q*, organized into into variable-length blocks with one block per state. An index provides offsets into this Q* array corresponding to the block boundaries, and an action array lists the actions available from each state. The values for Q* are stored using a 32-bit floating point representation, resulting in 534 MB for the three arrays associated with the sparse representation. Our method first converts to a 16-bit half-precision representation, sorts the state-action values within each block, adjusts the action array appropriately, and then removes redundant blocks. Although LZMA has a better compression ratio, it does not support real-time random access decompression. The behavior of the proposed method was demonstrated in simulation with negligible impact on safety and - perational performance metrics. Although this compression methodology was demonstrated on related MDPs with similar compression ratios, further work will apply this technique to other domains.

Keywords

Markov processes; aircraft control; collision avoidance; dynamic programming; -time random access decompression; LZMA; MDP; Markov decision processes; Q array; action array; action space model; aircraft collision avoidance system; aircraft positions; aircraft velocities; block boundaries; computational requirements; dynamic programming; floating point representation; memory requirements; operational performance metrics; optimal strategy; optimal value function compression; random-access lookups; reward model; sparse representation; state space model; transition model; variable-length blocks; word length 16 bit; word length 32 bit; Aircraft; Arrays; Collision avoidance; Computational modeling; Data compression; Dynamic programming; Markov processes; Markov decision process;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference (DCC), 2013

Conference_Location

Snowbird, UT

ISSN

1068-0314

Print_ISBN

978-1-4673-6037-1

Type

conf

DOI

10.1109/DCC.2013.81

Filename

6543111

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=610101