Title :
Impact of training corpus size on the quality of different types of language models for Serbian
Author :
Ostrogonac, Stevan ; Secujski, Milan ; Miskovic, Dragisa
Author_Institution :
Fac. of Tech. Sci., Univ. of Novi Sad, Novi Sad, Serbia
Abstract :
This paper describes a study on correspondence between the language model quality and the size of the textual corpus used in the training process. Three types of n-gram models developed for the Serbian language were included in the study: word-based, lemma-based and class-based model. They are created in order to deal with the data sparsity problem which is very expressed because of the high degree of inflection of the Serbian language. The three model types were trained on corpora of different sizes and evaluated by perplexity on authentic text and text with random word order in order to obtain the discrimination coefficients values. These values show different degrees of robustness of the three model types to data sparsity problem and indicate a way of combining these models in order to achieve the best language representation for a given training corpus.
Keywords :
natural languages; speech recognition; training; Serbian language; class-based model; data sparsity problem; language model quality; language models; language representation; lemma-based model; n-gram models; textual corpus; training corpus size; word-based model; Data models; Electronic mail; Equations; Mathematical model; Training; Training data; Vocabulary; Language model; discrimination coefficient; evaluation; perplexity;
Conference_Titel :
Telecommunications Forum (TELFOR), 2012 20th
Conference_Location :
Belgrade
Print_ISBN :
978-1-4673-2983-5
DOI :
10.1109/TELFOR.2012.6419309