مرکز منطقه ای اطلاع رساني علوم و فناوري - The entropy of English using PPM-based models

DocumentCode :

3265293

Title :

The entropy of English using PPM-based models

Author :

Teahan, W.J. ; Cleary, John G.

Author_Institution :

Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand

fYear :

1996

fDate :

Mar/Apr 1996

Firstpage :

Lastpage :

Abstract :

The purpose of this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by the previous results. This follows from a number of observations: firstly, the original human experiments used only 27 character English (letters plus space) against full 128 character ASCII text for most computer experiments; secondly, using large amounts of priming text substantially improves the PPM´s performance; and thirdly, the PPM algorithm can be modified to perform better for English text. The result of this is a machine performance down to 1.46 bit per character. The problem of estimating the entropy of English is discussed. The importance of training text for PPM is demonstrated, showing that its performance can be improved by “adjusting” the alphabet used. The results based on these improvements are then given, with compression down to 1.46 bpc

Keywords :

data compression; entropy; speech processing; ASCII text; English; English text; PPM algorithm; PPM based models; alphabet; compression; computer experiments; entropy; human experiments; human models; letters; machine models; machine performance; priming text; training text; Computer science; Context modeling; Cryptography; Entropy; Humans; Natural languages; Optical character recognition software; Speech recognition; Statistics; Upper bound;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Compression Conference, 1996. DCC '96. Proceedings

Conference_Location :

Snowbird, UT

ISSN :

1068-0314

Print_ISBN :

0-8186-7358-3

Type :

conf

DOI :

10.1109/DCC.1996.488310

Filename :

488310

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3265293