Improving n-gram probability estimates by compound-head clustering

Author

Pelemans, Joris ; Demuynck, Kris ; Van hamme, Hugo ; Wambacq, Patrick

Author_Institution

Dept. ESAT, Katholieke Univ. Leuven, Leuven, Belgium

fYear

2015

fDate

19-24 April 2015

Firstpage

5221

Lastpage

5225

Abstract

Compounding is one of the most productive word formation processes in many languages and is therefore a main source of data sparsity in language modeling. Many solutions have been suggested to model compound words, most of which break the compound into its constituents and train a new model with them. In earlier work, we argued that this approach is suboptimal and we presented a novel technique that clusters new, domain-specific compound words together with their semantic heads. The clusters were then used to build a class-based n-gram model that enabled a reliable estimation of n-gram probabilities, without the need for additional training data. In this paper, we investigate how this “semantic head mapping” can best be made an integral part of the language modeling strategy and find that, with some adaptations, our technique is capable of producing more accurate compound probability estimates than a baseline word-based n-gram language model, which lead to a significant word error rate reduction for Dutch read speech.

Keywords

pattern clustering; probability; semantic networks; speech recognition; Dutch read speech; class-based n-gram model; compound probability estimates; compounding; data sparsity; domain-specific compound words; language modeling; n-gram probabilities; semantic head mapping; word error rate reduction; Estimation; Pragmatics; Semantics; Training data; LVCSR; data sparsity; language models; n-grams; word clusters;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Conference_Location

South Brisbane, QLD

Type

conf

DOI

10.1109/ICASSP.2015.7178967

Filename

7178967