Title :
Symbol ranking text compressors
Author_Institution :
Dept. of Comput. Sci., Auckland Univ., New Zealand
Abstract :
Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an “identical twin” to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into “rankings” of “most probable symbol”, “next most probable symbol”, and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several “symbol ranking” compressors have appeared in the literature, though seldom with that name or even reference to Shannon´s work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus
Keywords :
data compression; entropy; statistical analysis; word processing; 1 Mbyte/s; Calgary Corpus; English text; LRU update; Shannon; constant-order contexts; identical twin; language entropy; original predictor; set-associative cache; skew distribution; software implementation; statistical compressor; symbol ranking text compressors; Compressors; Computer science; Data compression; Digital systems; Entropy; Humans; Taxonomy; World Wide Web;
Conference_Titel :
Data Compression Conference, 1997. DCC '97. Proceedings
Conference_Location :
Snowbird, UT
Print_ISBN :
0-8186-7761-9
DOI :
10.1109/DCC.1997.582093