DocumentCode :
3443450
Title :
Lossless compression of language model structure and word identifiers
Author :
Raj, B. ; Whittaker, E.W.D.
Author_Institution :
Mitsubishi Electr. Res. Lab., Cambridge, MA, USA
Volume :
1
fYear :
2003
fDate :
6-10 April 2003
Abstract :
Very large reductions in language model memory requirements have recently been reported for large vocabulary continuous speech recognition applications through the pruning and quantization of the floating-point components of the language model: the probabilities and back-off weights. In this paper that work is extended through the compression of the integer components: the word identifiers and storage structures. A novel algorithm is presented for converting ordered lists of monotonically increasing integer values (such as are commonly found in language models) into variable-bit width tree structures such that the most memory efficient configuration is obtained for each original list. By applying this new technique together with the techniques reported previously we obtain an 86% reduction in language model size to 10Mb for no increase in word error rate on the DARPA Hub4 1998 task and a 0.5% absolute increase on the Hub4 1997 task.
Keywords :
data compression; digital storage; grammars; natural languages; probability; quantisation (signal); speech coding; speech recognition; DARPA Hub4 1998 task; Hub4 1997 task; N-gram probabilities; back-off weights; floating-point components pruning; floating-point components quantization; integer components compression; language model memory reduction; language model size; language model structure; large vocabulary continuous speech recognition; lossless compression; memory efficient configuration; probabilities; storage structures; variable-bit width tree structures; word error rate; word identifiers; Degradation; Error analysis; Intrusion detection; Laboratories; Natural languages; Personal communication networks; Quantization; Speech recognition; Tree data structures; Vocabulary;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on
ISSN :
1520-6149
Print_ISBN :
0-7803-7663-3
Type :
conf
DOI :
10.1109/ICASSP.2003.1198799
Filename :
1198799
Link To Document :
بازگشت