DocumentCode :
918841
Title :
On the MDL principle for i.i.d. sources with large alphabets
Author :
Shamir, Gil I.
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Utah, Salt Lake City, UT
Volume :
52
Issue :
5
fYear :
2006
fDate :
5/1/2006 12:00:00 AM
Firstpage :
1939
Lastpage :
1955
Abstract :
Average case universal compression of independent and identically distributed (i.i.d.) sources is investigated, where the source alphabet is large, and may be sublinear in size or even larger than the compressed data sequence length n. In particular, the well-known results, including Rissanen´s strongest sense lower bound, for fixed-size alphabets are extended to the case where the alphabet size k is allowed to grow with n. It is shown that as long as k=o(n), instead of the coding cost in the fixed-size alphabet case of 0.5logn extra code bits for each one of the k-1 unknown probability parameters, the cost is now 0.5log(n/k) code bits for each unknown parameter. This result is shown to be the lower bound in the minimax and maximin senses, as well as for almost every source in the class. Achievability of this bound is demonstrated with two-part codes based on quantization of the maximum-likelihood (ML) probability parameters, as well as by using the well-known Krichevsky-Trofimov (KT) low-complexity sequential probability estimates. For very large alphabets, kGtn, it is shown that an average minimax and maximin bound on the redundancy is essentially (to first order) log(k/n) bits per symbol. This bound is shown to be achievable both with two-part codes and with a sequential modification of the KT estimates. For k=Theta(n), the redundancy is Theta(1) bits per symbol. Finally, sequential codes are designed for coding sequences in which only m<min{k,n} alphabet symbols occur
Keywords :
data compression; maximum likelihood estimation; probability; sequential codes; KT estimation; Krichevsky-Trofimov low-complexity probability; MDL principle; data compression; iid; independent-identical distributed source; maximum-likelihood probability parameter; minimum description length; sequential code; Cities and towns; Communication system control; Costs; Entropy; Gas insulated transmission lines; Helium; Information theory; Maximum likelihood estimation; Minimax techniques; Quantization; Independent and identically distributed (i.i.d.) sources; maximin redundancy; minimax redundancy; minimum description length (MDL); redundancy; redundancy for most sources; redundancy&#8211;capacity theorem; sequential codes; universal coding;
fLanguage :
English
Journal_Title :
Information Theory, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9448
Type :
jour
DOI :
10.1109/TIT.2006.872846
Filename :
1624633
Link To Document :
بازگشت