مرکز منطقه ای اطلاع رساني علوم و فناوري - Audio-visual speech recognition incorporating facial depth information captured by the Kinect

DocumentCode :

1856247

Title :

Audio-visual speech recognition incorporating facial depth information captured by the Kinect

Author :

Galatas, Georgios ; Potamianos, Gerasimos ; Makedon, Fillia

Author_Institution :

Inst. of Inf. & Telecommun., NCSR Demokritos, Athens, Greece

fYear :

2012

fDate :

27-31 Aug. 2012

Firstpage :

2714

Lastpage :

2717

Abstract :

We investigate the use of facial depth data of a speaking subject, captured by the Kinect device, as an additional speech-informative modality to incorporate to a traditional audiovisual automatic speech recognizer. We present our feature extraction algorithm for both visual and accompanying depth modalities, based on a discrete cosine transform of the mouth region-of-interest data, further transformed by a two-stage linear discriminant analysis projection to incorporate speech dynamics and improve classification. For automatic speech recognition utilizing the three available data streams (audio, visual, and depth), we consider both the feature and decision fusion paradigms, the latter via a state-synchronous tri-stream hidden Markov model. We report multi-speaker recognition results on a small-vocabulary task employing our recently collected bilingual audio-visual corpus with depth information, demonstrating improved recognition performance by the addition of the proposed depth stream, across a wide range of audio conditions.

Keywords :

audio-visual systems; discrete cosine transforms; face recognition; feature extraction; hidden Markov models; image classification; image fusion; interactive devices; speaker recognition; speech recognition; vocabulary; Kinect device; audio-visual speech recognition; bilingual audio-visual corpus; data streams; decision fusion paradigms; depth informa- tion; depth modality; discrete cosine transform; face detection; facial depth information; feature extraction algorithm; mouth region-of-interest data; small-vocabulary task; speaking subject; speech dynamics; state-synchronous tri-stream hidden Markov model; two-stage linear discriminant analysis projection; Feature extraction; Hidden Markov models; Mouth; Speech; Speech recognition; Streaming media; Visualization; Audio-visual automatic speech recognition; Microsoft Kinect; depth information; linear discriminant analysis; multi-sensory fusion;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European

Conference_Location :

Bucharest

ISSN :

2219-5491

Print_ISBN :

978-1-4673-1068-0

Type :

conf

Filename :

6334244

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1856247