Title :
Multipose audio-visual speech recognition
Author :
Estellers, Virginia ; Thiran, Jean-Philippe
Author_Institution :
Signal Process. Lab. LTS5, Ecole Polytech. Fed. de Lausanne (EPFL), Lausanne, Switzerland
fDate :
Aug. 29 2011-Sept. 2 2011
Abstract :
In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on the effects of a changing pose of the speaker relative to the camera, a problem encountered in natural situations. To that purpose, we introduce a pose normalization technique and perform speech recognition from multiple views by generating virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition studies and relies on linear regression to find an approximate mapping between images from different poses. Lipreading experiments quantify the loss of performance related to pose changes and the proposed pose normalization techniques, while audio-visual results analyse how an audio-visual system should account for non-frontal poses in terms of the weight assigned to the visual modality in the audio-visual classifier.
Keywords :
audio-visual systems; face recognition; speech recognition; approximate mapping; audio-visual classifier; audio-visual system; multipose audio-visual speech recognition; pose normalization technique; pose normalization techniques; pose-invariant face recognition; visual modality; Discrete cosine transforms; Feature extraction; Mouth; Speech; Speech recognition; Visualization;
Conference_Titel :
Signal Processing Conference, 2011 19th European
Conference_Location :
Barcelona