DocumentCode :
3431649
Title :
Symbol ranking text compressors
Author :
Fenwick, Peter
Author_Institution :
Dept. of Comput. Sci., Auckland Univ., New Zealand
fYear :
1997
fDate :
25-27 Mar 1997
Firstpage :
436
Abstract :
Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an “identical twin” to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into “rankings” of “most probable symbol”, “next most probable symbol”, and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several “symbol ranking” compressors have appeared in the literature, though seldom with that name or even reference to Shannon´s work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus
Keywords :
data compression; entropy; statistical analysis; word processing; 1 Mbyte/s; Calgary Corpus; English text; LRU update; Shannon; constant-order contexts; identical twin; language entropy; original predictor; set-associative cache; skew distribution; software implementation; statistical compressor; symbol ranking text compressors; Compressors; Computer science; Data compression; Digital systems; Entropy; Humans; Taxonomy; World Wide Web;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 1997. DCC '97. Proceedings
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-8186-7761-9
Type :
conf
DOI :
10.1109/DCC.1997.582093
Filename :
582093
Link To Document :
بازگشت