Author :
Åberg, J. ; Shtarkov, Yu.M. ; Smeets, B.J.M.
Abstract :
Some of the best compression ratios for text compression are provided by the PPM (prediction by partial matching) class of algorithms. These algorithms are based on arithmetic coding using a fixed-depth Markov chain model of the source, i.e., the subsequence of symbols generated in any state s of the source is assumed to be the output of a memoryless subsource w=w(s). One of the most crucial steps in any PPM algorithm is the choice of the escape probability, i.e., the probability of a previously unseen symbol appearing in a given state. Let A={1,...,M} be the source alphabet, xk=x1x1...xk∈Ak be a sequence of symbols generated by a subsource w with unknown parameters, and m=m(xk) be the number of different symbols in xk. In most incarnations of PPM, the expression for the escape probability used for coding is of the form ϑ(E|xk )=k(k,m)/k+mα+k(k,m), where α is a parameter. The encoding of the “escape symbol” E is followed by a description of the new symbol, which is independent of the choice of escape probability and is not discussed. In almost all PPM schemes, the expression for the escape probability has been chosen heuristically. Following the method of Shtarkov et al. (see Probl. of Information Transmission, vol.31, no.2, p.20-35, 1995) we use a universal coding approach
Keywords :
Markov processes; arithmetic codes; data compression; document image processing; image coding; image matching; memoryless systems; prediction theory; probability; source coding; word processing; algorithms; arithmetic coding; compression ratios; encoding; escape probabilities estimation; escape symbol; fixed-depth Markov chain model; memoryless subsource; prediction by partial matching; source alphabet; symbols subsequence; text compression; universal source coding theory; Data compression; Encoding; Frequency estimation; Information theory; Optimization methods; Optimized production technology; Probability distribution; Source coding; State estimation;