مرکز منطقه ای اطلاع رساني علوم و فناوري - A convergent gambling estimate of the entropy of English

Abstract :

In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon´s technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n}denotes the subject´s capital afternbets at27for1odds, and if it is assumed that the subject knows the underlying probability distribution for the processX, then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X)is an asymptotic upper bound for the true entropy. IfXis stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X)being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X)with probability one. Preliminary indications are that English text has an entropy of approximately1.3bits/symbol, which agrees well with Shannon´s estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon´s technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n}denotes the subject´s capital afternbets at27for1odds, and if it is assumed that the subject knows the underlying probability distribution for the processX, then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X)is an asymptotic upper - bound for the true entropy. IfXis stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X)being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X)with probability one. Preliminary indications are that English text has an entropy of approximately1.3bits/symbol, which agrees well with Shannon´s estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon´s technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n}denotes the subject´s capital afternbets at27for1odds, and if it is assumed that the subject knows the underlying probability distribution for the processX, then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X)is an asymptotic upper bound for the true entropy. IfXis stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X)being the true entropy of the process.Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X)with probability one. Preliminary indications are that English text has an entropy of a