DocumentCode
2665243
Title
Automatic word spacing using syllable n-grame and word statistics [n-grame read n-gram]
Author
Kang, Mi-Young ; Choi, Sung-Ja ; Heo, Hee-Keun ; Lim, Sung-Shin ; Kwon, Hyuk-Chul
Author_Institution
Sch. of Electr. & Comput. Eng., Pusan Nat. Univ., South Korea
fYear
2003
fDate
26-29 Oct. 2003
Firstpage
419
Lastpage
424
Abstract
In this study, we have proposed an automatic word spacing system for the Korean language, which uses syllable n-gram and word statistics extracted from a large amount of processed corpora. The optimal spacing points of a sentence are decided mainly by using the Viterbi algorithm. As the statistical studies performance is sensitive to the training corpus and shows data sparseness problem, we have tried to enlarge the training corpora, used parameters found by examining test data and proposed an adjusting method of the ´longest match strategy´ based on the viable prefix. These increase the system´s accuracy. Our corpora, covering various language registers, were made up of 33643884 words. The pilot test was conducted with test data derived from different sources. 94.24% precision in word-unit correction were obtained on average for spacing test data.
Keywords
linguistics; maximum likelihood estimation; natural languages; word processing; Korean language; Viterbi algorithm; automatic word spacing system; data sparseness problem; syllable n-gram statistics; training corpora; word statistics; Error correction; Frequency estimation; Information retrieval; Natural language processing; Natural languages; Speech synthesis; Statistical analysis; Statistics; Testing; Viterbi algorithm;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location
Beijing, China
Print_ISBN
0-7803-7902-0
Type
conf
DOI
10.1109/NLPKE.2003.1275942
Filename
1275942
Link To Document