مرکز منطقه ای اطلاع رساني علوم و فناوري - Impact of training corpus size on the quality of different types of language models for Serbian

DocumentCode :

3063133

Title :

Impact of training corpus size on the quality of different types of language models for Serbian

Author :

Ostrogonac, Stevan ; Secujski, Milan ; Miskovic, Dragisa

Author_Institution :

Fac. of Tech. Sci., Univ. of Novi Sad, Novi Sad, Serbia

fYear :

2012

fDate :

20-22 Nov. 2012

Firstpage :

720

Lastpage :

723

Abstract :

This paper describes a study on correspondence between the language model quality and the size of the textual corpus used in the training process. Three types of n-gram models developed for the Serbian language were included in the study: word-based, lemma-based and class-based model. They are created in order to deal with the data sparsity problem which is very expressed because of the high degree of inflection of the Serbian language. The three model types were trained on corpora of different sizes and evaluated by perplexity on authentic text and text with random word order in order to obtain the discrimination coefficients values. These values show different degrees of robustness of the three model types to data sparsity problem and indicate a way of combining these models in order to achieve the best language representation for a given training corpus.

Keywords :

natural languages; speech recognition; training; Serbian language; class-based model; data sparsity problem; language model quality; language models; language representation; lemma-based model; n-gram models; textual corpus; training corpus size; word-based model; Data models; Electronic mail; Equations; Mathematical model; Training; Training data; Vocabulary; Language model; discrimination coefficient; evaluation; perplexity;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Telecommunications Forum (TELFOR), 2012 20th

Conference_Location :

Belgrade

Print_ISBN :

978-1-4673-2983-5

Type :

conf

DOI :

10.1109/TELFOR.2012.6419309

Filename :

6419309

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3063133