Title :
Context models in the MDL framework
Author :
Ristad, E.S. ; Thomas, R.G., III
Author_Institution :
Dept. of Comput. Sci., Princeton Univ., NJ
Abstract :
Current approaches to speech and handwriting recognition demand a strong language model with a small number of states and an even smaller number of parameters. We introduce four new techniques for statistical language models: multicontextual modeling, nonmonotonic contexts, implicit context growth, and the divergence heuristic. Together these techniques result in language models that have few states, even fewer parameters, and low message entropies. For example, our techniques achieve a message entropy of 2.16 bits/char on the Brown corpus using only 19374 contexts and 54621 parameters. Multicontextual modeling and nonmonotonic contexts, are generalizations of the traditional context model. Implicit context growth ensures that the state transition probabilities of a variable-length Markov process are estimated accurately. This technique is generally applicable to any variable-length Markov process whose state transition probabilities are estimated from string frequencies. In our case, each state in the Markov process represents a context, and implicit context growth conditions the shorter contexts on the fact that the longer contexts did not occur. In a traditional unicontext model, this technique reduces the message entropy of typical English text by 0.1 bits/char. The divergence heuristic, is a heuristic estimation algorithm based on Rissanen´s (1978, 1983) minimum description length (MDL) principle and universal data compression algorithm
Keywords :
Markov processes; context-sensitive languages; data compression; entropy codes; handwriting recognition; heuristic programming; probability; speech coding; speech recognition; statistical analysis; Brown corpus; English text; code length; divergence heuristic; handwriting recognition; heuristic estimation algorithm; implicit context growth; low message entropies; minimum description length; multicontextual modeling; nonmonotonic contexts; speech recognition; state transition probabilities; statistical language models; string frequencies; universal data compression algorithm; variable-length Markov process; Context modeling; Data compression; Entropy; Frequency estimation; Handwriting recognition; Heuristic algorithms; Markov processes; Natural languages; Speech; State estimation;
Conference_Titel :
Data Compression Conference, 1995. DCC '95. Proceedings
Conference_Location :
Snowbird, UT
Print_ISBN :
0-8186-7012-6
DOI :
10.1109/DCC.1995.515496