DocumentCode
428585
Title
Study on Good-Turing and a novel smoothing method based on real corpora for language models
Author
Feng-Long, Huang ; Ming-Shing, Yu
Author_Institution
Dept. of Comput. Sci. & Inf. Eng., Nat. United Univ., Taiwan
Volume
4
fYear
2004
fDate
10-13 Oct. 2004
Firstpage
3741
Abstract
In this paper, we study the well-known Good-Turing smoothing technique, which has inherited issue and propose a novel method. The smoothing method is used to resolve the zero count problems in traditional language models. Basically, there are two processes for smoothing techniques: discounting and redistributing. The statistical behavior of smoothing method is analyzed on various training data size. Several features for Good-Turing are used to improve the method. We propose novel smoothing methods based on real training data sets, which will reflect the physical behavior of smoothing method. With respect to the heuristic probability assignment used by most smoothing methods, our method is based on real training corpora. The curves of number for events with different counts can be plotted. The behavior of our method is analyzed. The empirical results are presented to compare the effectiveness of smoothing methods.
Keywords
natural languages; smoothing methods; statistical analysis; Good-Turing smoothing method; heuristic probability assignment; language models; real training data sets; statistical behavior; zero count problems; Computer science; Entropy; Maximum likelihood estimation; Natural language processing; Natural languages; Predictive models; Probability; Smoothing methods; Stochastic processes; Training data;
fLanguage
English
Publisher
ieee
Conference_Titel
Systems, Man and Cybernetics, 2004 IEEE International Conference on
ISSN
1062-922X
Print_ISBN
0-7803-8566-7
Type
conf
DOI
10.1109/ICSMC.2004.1400926
Filename
1400926
Link To Document