Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units

Author

Sethy, Abhinav ; Narayanan, Shrikanth

Author_Institution

Dept. of Electr. Eng. Syst., Univ. of Southern California, Los Angeles, CA, USA

Volume

1

fYear

2003

fDate

6-10 April 2003

Abstract

Most speech recognition systems, especially LVCSR, use context dependent phones as the basic acoustic unit for recognition. The primary motive for this is the relative ease with which phone based systems can be trained robustly with small amounts of data. However as recent research indicates, significant improvements in recognition accuracy can be gained by using acoustic units of longer duration such as syllables. Syllable and other longer length units provide an efficient way for modeling long term temporal dependencies in speech which are difficult to cover in a phoneme based recognition framework. But these longer duration units suffer from the training data sparsity problem since a large number of units in the lexicon will have little or no acoustic training data. In this paper we present a two step approach to address the training data sparsity problem. First we use CD phones to initialize the higher level units in a manner which minimizes the impact of training data sparsity. Subsequently we present methods to split the lexicon into units of different acoustic length based on a analysis of the training data. We present results which show that a 25-30% improvement in terms of word error rate can be achieved by using CD phone initialization and variable length unit selection on a medium vocabulary continuous speech recognition task.

Keywords

error statistics; speech processing; speech recognition; vocabulary; CD phone initialization; CD phones; LVCSR; continuous speech recognition task; long term temporal dependencies; medium vocabulary; modeling; split-lexicon based hierarchical recognition; syllables; training data sparsity problem; two step approach; variable length unit selection; word error rate; word level acoustic units; Acoustical engineering; Automatic speech recognition; Context modeling; Error analysis; Feature extraction; Robustness; Speech analysis; Speech recognition; Training data; Vocabulary;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on

ISSN

1520-6149

Print_ISBN

0-7803-7663-3

Type

conf

DOI

10.1109/ICASSP.2003.1198895

Filename

1198895