مرکز منطقه ای اطلاع رساني علوم و فناوري - Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

DocumentCode :

1874092

Title :

Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

Author :

Garg, Ashutosh ; Potamianos, Gerasimos ; Neti, Chalapathy ; Huang, Thomas S.

Author_Institution :

Beckman Inst., Illinois Univ., Urbana, IL, USA

Volume :

fYear :

2003

fDate :

6-9 July 2003

Abstract :

We investigate the use of local, frame-dependent reliability indicators of the audio and visual modalities, as a means of estimating stream exponents of multi-stream hidden Markov models for audio-visual automatic speech recognition. We consider two such indicators at each modality, defined as functions of the speech-class conditional observation probabilities of appropriate audio or visual-only classifiers. We subsequently map the four reliability indicators into the stream exponents of a state-synchronous, two-stream hidden Markov model, as a sigmoid function of their linear combination. We propose two algorithms to estimate the sigmoid weights, based on the maximum conditional likelihood and minimum classification error criteria. We demonstrate the superiority of the proposed approach on a connected-digit audio-visual speech recognition task, under varying audio channel noise conditions. Indeed, the use of estimated, frame-dependent stream exponents results in a significantly smaller word error rare than using global stream exponents. In addition, it outperforms utterance-level exponents, even though the latter utilize a-priori knowledge of the utterance noise level.

Keywords :

hidden Markov models; maximum likelihood estimation; reliability; speech recognition; HMM; a-priori knowledge; audio channel noise conditions; audio-visual speech recognition; frame-dependent multistream reliability indicators; global stream exponents; hidden Markov models; maximum conditional likelihood; minimum classification error criteria; sigmoid function; state-synchronous; utterance noise level; Automatic speech recognition; Degradation; Error analysis; Hidden Markov models; Humans; Neural networks; Noise level; Robustness; Speech recognition; Streaming media;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on

Print_ISBN :

0-7803-7965-9

Type :

conf

DOI :

10.1109/ICME.2003.1221384

Filename :

1221384

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1874092