Discriminatively estimated joint acoustic, duration, and language model for speech recognition

Author

Lehr, Maider ; Shafran, Izhak

Author_Institution

Center for Spoken Language Understanding (CSLU), Oregon Health & Sci. Univ., Portland, OR, USA

fYear

2010

fDate

14-19 March 2010

Firstpage

5542

Lastpage

5545

Abstract

We introduce a discriminative model for speech recognition that integrates acoustic, duration and language components. In the framework of finite state machines, a general model for speech recognition G is a finite state transduction from acoustic state sequences to word sequences (e.g., search graph in many speech recognizers). The lattices from a baseline recognizer can be viewed as an a posteriori version of G after having observed an utterance. So far, discriminative language models have been proposed to correct the output side of G and is applied on the lattices. The acoustic state sequences on the input side of these lattice can also be exploited to improve the choice of the best hypotheses through the lattice. Taking this view, the model proposed in this paper jointly estimates the parameters for acoustic and language components in a discriminative setting. The resulting model can be factored as corrections for the input and the output sides of the general model G. This formulation allows us to incorporate duration cues seamlessly. Empirical results on a large vocabulary Arabic GALE task demonstrate that the proposed model improves word error rate substantially, with a gain of 1.6% absolute. Through a series of experiments we analyze the contributions from and interactions between acoustic, duration and language components to find that duration cues play an important role in Arabic task.

Keywords

linguistics; speech recognition; acoustic modeling; acoustic state sequences; discriminative language model; duration cues; duration modeling; finite state transduction; language modeling; large vocabulary Arabic GALE task; speech recognition; Automata; Decoding; Error analysis; Lattices; Natural languages; Parameter estimation; Performance gain; Speech recognition; Vectors; Vocabulary; acoustic modeling; discriminative modeling; duration modeling; language modeling;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on

Conference_Location

Dallas, TX

ISSN

1520-6149

Print_ISBN

978-1-4244-4295-9

Electronic_ISBN

1520-6149

Type

conf

DOI

10.1109/ICASSP.2010.5495227

Filename

5495227