مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

3431649

Title :

Symbol ranking text compressors

Author :

Fenwick, Peter

Author_Institution :

Dept. of Comput. Sci., Auckland Univ., New Zealand

fYear :

1997

fDate :

25-27 Mar 1997

Firstpage :

436

Abstract :

Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an “identical twin” to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into “rankings” of “most probable symbol”, “next most probable symbol”, and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several “symbol ranking” compressors have appeared in the literature, though seldom with that name or even reference to Shannon´s work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus

Keywords :

data compression; entropy; statistical analysis; word processing; 1 Mbyte/s; Calgary Corpus; English text; LRU update; Shannon; constant-order contexts; identical twin; language entropy; original predictor; set-associative cache; skew distribution; software implementation; statistical compressor; symbol ranking text compressors; Compressors; Computer science; Data compression; Digital systems; Entropy; Humans; Taxonomy; World Wide Web;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Compression Conference, 1997. DCC '97. Proceedings

Conference_Location :

Snowbird, UT

ISSN :

1068-0314

Print_ISBN :

0-8186-7761-9

Type :

conf

DOI :

10.1109/DCC.1997.582093

Filename :

582093

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3431649