A training procedure for a segment-based-network approach to isolated word recognition

Author

Soong, F.K.

Author_Institution

AT&T Bell Laboratories, Murray Hill, New Jersey

Volume

12

fYear

1987

fDate

31868

Firstpage

693

Lastpage

696

Abstract

In this paper, we propose a complete training procedure for creating a subword-based network and test it in an isolated word recognition experiment. We first hand segment one training token per word into contiguous subword segments with the aid of an interactive program that can display and playback various acoustic features of an utterance. The subword segmental units adopted in this paper consist of four different sound classes including: stationary sounds, fast transitional sounds, slow transitional sounds plus consonant clusters and others. The hand segmented token is used to initialize a subword-based word network which is then refined by using more training tokens. The refinement is carried out with a two-level dynamic programming (DP) procedure. At the first level, or the word level, an endpoint-relaxed DP algorithm is used to remove any possible endpointing errors and to mark tentative segment boundaries. Between the marked segment boundaries, another endpoint-relaxed DP algorithm is employed at the segment level to refine the segments extracted at the word level. A segment-based word network, which consists of serial and parallel branches, is generated from this training procedure. While serial branches are generated by using acoustically similar segments aligned at the segment level parallel branches are created for accomodating different acoustic manifestations of the same sound class in different phonetic contexts or different pronunciations. A speaker-dependent, isolated word, recognition experiment was carried out. For a four-speaker(2 male and 2 female), English alphabet data base, the segment-based network, when compared with a conventional word-template-based approach, gives improved performance. The word error rate is reduced from 11.2% for the word-based recognizer down to 7.7% for the network-based recognizer; or correspondingly, the number of misrecognized words is reduced from 116 to 80 out of 1040 recognition trials.

Keywords

Acoustic testing; Clustering algorithms; Computational complexity; Displays; Dynamic programming; Error analysis; Speech recognition; Switches; Vocabulary;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '87.

Type

conf

DOI

10.1109/ICASSP.1987.1169579

Filename

1169579