Title :
Comparing models for audiovisual fusion in a noisy-vowel recognition task
Author :
Teissier, Pascal ; Robert-Ribes, Jordi ; Schwartz, Jean-Luc ; Guérin-Dugué, Anne
Author_Institution :
Inst. de la Commun. Parlee, CNRS, Grenoble, France
fDate :
11/1/1999 12:00:00 AM
Abstract :
Audiovisual speech recognition involves fusion of the audio and video sensors for phonetic identification. There are three basic ways to fuse data streams for taking a decision such as phoneme identification: data-to-decision, decision-to-decision, and data-to-data. This leads to four possible models for audiovisual speech recognition, that is direct identification in the first case, separate identification in the second one, and two variants of the third early integration case, namely dominant recoding or motor recoding. However, no systematic comparison of these models is available in the literature. We propose an implementation of these four models, and submit them to a benchmark test. For this aim, we use a noisy-vowel corpus tested on two recognition paradigms in which the systems are tested at noise levels higher than those used for learning. In one of these paradigms, the signal-to-noise ratio (SNR) value is provided to the recognition systems, in the other it is not. We also introduce a new criterion for evaluating performances, based on transmitted information on individual phonetic features. In light of the compared performances of the four models with the two recognition paradigms, we discuss the advantages and drawbacks of these models, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability
Keywords :
audio-visual systems; data structures; identification; noise; sensor fusion; speech coding; speech recognition; video coding; SNR; audio sensors; audiovisual fusion models; audiovisual speech recognition; benchmark test; data representation; data streams fusion; direct identification; dominant recoding; early integration; fusion architecture; motor recoding; noise levels; noisy-vowel corpus; noisy-vowel recognition; phoneme identification; phonetic features; phonetic identification; recognition paradigms; sensor fusion; sensor reliability; separate identification; signal-to-noise ratio; video sensors; video-speech workstation; Benchmark testing; Fuses; Noise level; Performance evaluation; Proposals; Sensor fusion; Signal to noise ratio; Speech recognition; Streaming media; System testing;
Journal_Title :
Speech and Audio Processing, IEEE Transactions on