• DocumentCode
    260391
  • Title

    Comparison of Modified Kneser-Ney and Witten-Bell smoothing techniques in statistical language model of Bahasa Indonesia

  • Author

    Ismail

  • Author_Institution
    Comput. Eng. Dept., Telkom Univ., Bandung, Indonesia
  • fYear
    2014
  • fDate
    28-30 May 2014
  • Firstpage
    409
  • Lastpage
    412
  • Abstract
    Smoothing is one technique to overcome data sparsity in statistical language model. Although in its mathematical definition there is no explicit dependency upon specific natural language, different natures of natural languages result in different effects of smoothing techniques. This is true for Russian language as shown by Whittaker [2]. In this paper, We compared Modified Kneser-Ney and Witten-Bell smoothing techniques in statistical language model of Bahasa Indonesia. We used train sets of totally 22M words that we extracted from Indonesian version of Wikipedia. As far as we know, this is the largest train set used to build statistical language model for Bahasa Indonesia. The experiments with 3-gram, 5-gram, and 7-gram showed that Modified Kneser-Ney consistently outperforms Witten-Bell smoothing technique in term of perplexity values. It is interesting to note that our experiments showed 5-gram model for Modified Kneser-Ney smoothing technique outperforms that of 7-gram. Meanwhile, Witten-Bell smoothing is consistently improving over the increase of n-gram order.
  • Keywords
    computational linguistics; natural language processing; statistical analysis; 3-gram model; 5-gram model; 7-gram model; Bahasa Indonesia; Russian language; Wikipedia; Witten-Bell smoothing techniques; data sparsity; modified Kneser-Ney smoothing techniques; n-gram order; natural language; statistical language model; Computational modeling; Internet; Mathematical model; Natural languages; Smoothing methods; Standards; Training; Kneser-Ney; Witten-Bell; n-gram; smoothing technique; statistical language model of Bahasa Indonesia;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information and Communication Technology (ICoICT), 2014 2nd International Conference on
  • Conference_Location
    Bandung
  • Type

    conf

  • DOI
    10.1109/ICoICT.2014.6914097
  • Filename
    6914097