Title :
n-gram estimates in probabilistic models for Pinyin to Hanzi transcription
Author :
Lochovsky, Amelia Fong ; Cheung, Hon-Kit
Author_Institution :
Dept. of Comput. Sci., Hong Kong Univ. of Sci. & Technol., Hong Kong
Abstract :
We consider the problem of sparse data in probabilistic modeling of the Chinese language. To date, n-gram models outperform models that try to capture linguistical structures. Various techniques for estimating n-gram statistics for the English language have been proposed and compared. It is known that how various techniques actually perform depends on the problem domain in which the probabilistic model is applied. We apply different smoothing techniques in the estimates of bigram statistics in a word based bigram model for Pinyin to Hanzi transcription. Comparative results are reported and show improved accuracy over the MLE method. We have also experimented with hybrid approaches (using bigrams as well as monograms) to achieve superior results
Keywords :
language translation; natural languages; probability; word processing; Chinese language; English language; Hanzi transcription; MLE method; Pinyin; bigram statistics; hybrid approaches; linguistical structures; monograms; n-gram estimates; probabilistic model; probabilistic modeling; probabilistic models; problem domain; smoothing techniques; sparse data; word based bigram model; Computer science; Equations; Frequency estimation; Information theory; Maximum likelihood estimation; Natural languages; Optical character recognition software; Smoothing methods; Speech recognition; Statistics;
Conference_Titel :
Intelligent Processing Systems, 1997. ICIPS '97. 1997 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
0-7803-4253-4
DOI :
10.1109/ICIPS.1997.669366