Title :
On Finite Memory Universal Data Compression and Classification of Individual Sequences
Author_Institution :
Technion - Israel Inst. of Technol., Haifa
fDate :
4/1/2008 12:00:00 AM
Abstract :
Consider the case where consecutive blocks of letters of a semi-infinite individual sequence over a finite-alphabet are being compressed into binary sequences by some one-to-one mapping. No a priori information about is available at the encoder, which must therefore adopt a universal data-compression algorithm. It is known that if the universal Lempel-Ziv (LZ) data compression algorithm is successively applied to -blocks then the best error-free compression, for the particular individual sequence is achieved as tends to infinity. The best possible compression that may be achieved by any universal data compression algorithm for finite -blocks is discussed. It is demonstrated that context tree coding essentially achieves it. Next, consider a device called classifier (or discriminator) that observes an individual training sequence . The classifier´s task is to examine individual test sequences of length and decide whether the test -sequence has the same features as those that are captured by the training sequence , or is sufficiently different, according to some appropriate criterion. Here again, it is demonstrated that a particular universal context classifier with a storage-space complexity that is linear in , is essentially optimal. This may contribute a theoretical ldquoindividual sequencerdquo justification for the Probabilistic Suffix Tree (PST) approach in learning theory and in computational biology.
Keywords :
binary sequences; data compression; tree codes; binary sequences; context tree coding; discriminator; finite memory universal data classification; finite memory universal data compression; individual sequence justification; individual training sequence; probabilistic suffix tree approach; storage-space complexity; universal Lempel-Ziv data compression; universal context classifier; Binary sequences; Biological system modeling; Compression algorithms; Computational biology; Data compression; H infinity control; Information analysis; Jacobian matrices; Testing; Turing machines; Context-tree coding; data compression; universal classification; universal compression;
Journal_Title :
Information Theory, IEEE Transactions on
DOI :
10.1109/TIT.2008.917666