Title :
Analysis of smoothing methods for language models on small Chinese corpora
Author :
Ming-Chun Liou ; Feng-Long Huang ; Ming-Shing Yu ; Yih-Jeng Lin
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. United Univ., MaioLi, Taiwan
Abstract :
Data sparseness has been an inherited issue of statistical language models and smoothing method shave been used to resolve the issue of zero count. 20 Chinese language models from 1M to 20M Chinese words of CGW have been generated on small sizes corpus because of worse situation of zero count issue. Five smoothing methods, such as Good Turing and Advanced Good Turing smoothing, including our 2 proposed methods, are evaluated and analyzed on inside testing and outside testing. It is shown that to alleviate the issue of data sparseness on various sizes of language models. The best one among these methods is our proposed YH-B which performs best in all the various models.
Keywords :
natural language processing; CGW; Data sparseness; YH-B; advanced GoodTuring smoothing; language models; small Chinese corpora; smoothing methods; statistical language models; Abstracts; Acoustics; Analytical models; Artificial intelligence; Entropy; Maximum likelihood estimation; Signal resolution; Cross Entropy; Language Models; Perplexity; Smoothing Methods;
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2014 International Conference on
Conference_Location :
Lanzhou
Print_ISBN :
978-1-4799-4216-9
DOI :
10.1109/ICMLC.2014.7009658