Title :
Study on Good-Turing and a novel smoothing method based on real corpora for language models
Author :
Feng-Long, Huang ; Ming-Shing, Yu
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. United Univ., Taiwan
Abstract :
In this paper, we study the well-known Good-Turing smoothing technique, which has inherited issue and propose a novel method. The smoothing method is used to resolve the zero count problems in traditional language models. Basically, there are two processes for smoothing techniques: discounting and redistributing. The statistical behavior of smoothing method is analyzed on various training data size. Several features for Good-Turing are used to improve the method. We propose novel smoothing methods based on real training data sets, which will reflect the physical behavior of smoothing method. With respect to the heuristic probability assignment used by most smoothing methods, our method is based on real training corpora. The curves of number for events with different counts can be plotted. The behavior of our method is analyzed. The empirical results are presented to compare the effectiveness of smoothing methods.
Keywords :
natural languages; smoothing methods; statistical analysis; Good-Turing smoothing method; heuristic probability assignment; language models; real training data sets; statistical behavior; zero count problems; Computer science; Entropy; Maximum likelihood estimation; Natural language processing; Natural languages; Predictive models; Probability; Smoothing methods; Stochastic processes; Training data;
Conference_Titel :
Systems, Man and Cybernetics, 2004 IEEE International Conference on
Print_ISBN :
0-7803-8566-7
DOI :
10.1109/ICSMC.2004.1400926