Title :
Stream-weighted HMM for audio-visual ASR: a study on connected digit recognition
Author :
Chan, Michael T.
Author_Institution :
Rockwell Sci. Co., Thousand Oaks, CA, USA
Abstract :
We present some new results on connected digit recognition in noisy environments by audio-visual speech recognition. We derive hybrid (geometric- and appearance-based) visual lip features using a real-time lip-tracking algorithm that we proposed previously. Using a single-speaker corpus modeled after the TIDIGITS database, we build whole-word HMMs using both single-stream and 2-stream modeling strategies. For the 2-stream HMM method, we use stream dependent weights to adjust the relative contributions of the two feature streams based on the acoustic SNR level. The 2-stream HMM consistently gave the lowest WER, with an error reduction of 83% at -3 dB SNR level compared to the acoustic-only baseline. Visual-only ASR WER at 6.85% was also achieved, showing the effectiveness of the visual features. A real-time system prototype was developed for concept demonstration.
Keywords :
audio-visual systems; error statistics; feature extraction; gesture recognition; hidden Markov models; sensor fusion; speech recognition; video signal processing; 2-stream HMM; TIDIGITS database; acoustic SNR level; audio-visual ASR; audio-visual speech recognition; automatic speech recognition; connected digit recognition; hidden Markov models; hybrid visual lip features; noisy environments; real-time lip-tracking; single-speaker corpus; stream-weighted HMM; word error rate; Active shape model; Automatic speech recognition; Deformable models; Feature extraction; Hidden Markov models; Lips; Speech recognition; Streaming media; Tracking; Working environment noise;
Conference_Titel :
Multimedia Signal Processing, 2002 IEEE Workshop on
Print_ISBN :
0-7803-7713-3
DOI :
10.1109/MMSP.2002.1203233