• DocumentCode
    2665243
  • Title

    Automatic word spacing using syllable n-grame and word statistics [n-grame read n-gram]

  • Author

    Kang, Mi-Young ; Choi, Sung-Ja ; Heo, Hee-Keun ; Lim, Sung-Shin ; Kwon, Hyuk-Chul

  • Author_Institution
    Sch. of Electr. & Comput. Eng., Pusan Nat. Univ., South Korea
  • fYear
    2003
  • fDate
    26-29 Oct. 2003
  • Firstpage
    419
  • Lastpage
    424
  • Abstract
    In this study, we have proposed an automatic word spacing system for the Korean language, which uses syllable n-gram and word statistics extracted from a large amount of processed corpora. The optimal spacing points of a sentence are decided mainly by using the Viterbi algorithm. As the statistical studies performance is sensitive to the training corpus and shows data sparseness problem, we have tried to enlarge the training corpora, used parameters found by examining test data and proposed an adjusting method of the ´longest match strategy´ based on the viable prefix. These increase the system´s accuracy. Our corpora, covering various language registers, were made up of 33643884 words. The pilot test was conducted with test data derived from different sources. 94.24% precision in word-unit correction were obtained on average for spacing test data.
  • Keywords
    linguistics; maximum likelihood estimation; natural languages; word processing; Korean language; Viterbi algorithm; automatic word spacing system; data sparseness problem; syllable n-gram statistics; training corpora; word statistics; Error correction; Frequency estimation; Information retrieval; Natural language processing; Natural languages; Speech synthesis; Statistical analysis; Statistics; Testing; Viterbi algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
  • Conference_Location
    Beijing, China
  • Print_ISBN
    0-7803-7902-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2003.1275942
  • Filename
    1275942