مرکز منطقه ای اطلاع رساني علوم و فناوري - Audio-visual speech modeling for continuous speech recognition

DocumentCode :

1376485

Title :

Audio-visual speech modeling for continuous speech recognition

Author :

Dupont, Stéphane ; Luettin, Juergen

Author_Institution :

TCTS Lab., Mons Polytech. Inst., Belgium

Volume :

Issue :

fYear :

2000

fDate :

9/1/2000 12:00:00 AM

Firstpage :

141

Lastpage :

151

Abstract :

This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate

Keywords :

face recognition; feature extraction; hidden Markov models; learning by example; multimedia systems; sensor fusion; speech recognition; acoustic module; acoustic perceptual linear prediction features; acoustic speech information; appearance-based lip model; audio-visual speech modeling; continuous speech recognition; continuously spoken digits; contour information; example based learning; example images; grey-level information; joint temporal modeling; large multispeaker database; lip movements; lipreading; mouth area; multistream hidden Markov models; noise robust RASTA-PLP; noise-robust features; noisy environments; recognition performance; relative spectra; sensor fusion module; signal-to-noise ratio; speech features; speechreading; temporal dependencies; visual module; visual speech features; visual speech information; Acoustic noise; Data mining; Error analysis; Feature extraction; Hidden Markov models; Noise robustness; Sensor fusion; Signal to noise ratio; Speech recognition; Streaming media;

fLanguage :

English

Journal_Title :

Multimedia, IEEE Transactions on

Publisher :

ieee

ISSN :

1520-9210

Type :

jour

DOI :

10.1109/6046.865479

Filename :

865479

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1376485