Improving language modeling by using distance and co-occurrence information of word-pairs and its application to LVCSR

Author

Tze Yuang Chong ; Banchs, Rafael E. ; Eng Siong Chng ; Haizhou Li

Author_Institution

Temasek Labs., Nanyang Technol. Univ., Singapore, Singapore

fYear

2014

fDate

4-9 May 2014

Firstpage

4883

Lastpage

4887

Abstract

This paper reports our study in exploiting the distance and co-occurrence information of word-pairs to improve the n-gram language model. We used these two types of information for modeling the distant context, up to history length of ten. Also we show that the proposed model provides complementary information about the n-gram´s context that is unable to be captured by the n-gram model due to data scarcity. Evaluated on the WSJ and SWB-1 corpora, the proposed model reduced the trigram perplexity up to 11.2% and 6.5% respectively. In an N-best re-ranking task of the Aurora-4 database, our model aided a hexagram model to perform ~9% relatively better in terms of WER.

Keywords

natural language processing; speech recognition; Aurora-4 database; LVCSR; SWB-1 corpora; WSJ corpora; data scarcity; hexagram model; n-gram language modeling improvement; natural language processing tasks; speech recognition; word-pairs co-occurrence information; word-pairs distance information; Adaptation models; Computational modeling; Context; Context modeling; Hidden Markov models; History; Speech recognition; Term-distance; language model; speech recognition; term-occurrence;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on

Conference_Location

Florence

Type

conf

DOI

10.1109/ICASSP.2014.6854530

Filename

6854530