Dynamic visual features based on discriminative speech class projection for visual speech recognition

Author

Lei, Xie ; Xiu-Li, Cai ; Zhong-Hua, Fu ; Rong-Chun, Zhao

Author_Institution

Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi´´an, China

fYear

2004

fDate

20-22 Oct. 2004

Firstpage

687

Lastpage

690

Abstract

This paper presents a dynamic visual feature extraction scheme to capture important lip motion information for visual speech recognition. Discriminative projections based on a-priori chosen speech classes, phonemes and visemes, are applied to the concatenation of pre-extracted static visual features. First- and second-order temporal derivatives are subsequently extracted to further represent the dynamic differences. Experiments on a connected digits task demonstrate that the proposed high discriminative dynamic features, when augmented to the static, yields superior recognition performance. Compared to the commonly used delta and acceleration features, the proposed dynamic feature leads to an 8% absolute improvement in terms of word accuracy for the considered recognition task.

Keywords

feature extraction; hidden Markov models; image sequences; speech recognition; MPEG-1 video; concatenated pre-extracted static visual features; discriminative dynamic features; discriminative speech class projection; dynamic visual feature extraction; linear discriminant analysis; lip motion information; mouth image sequences; phonemes; temporal derivatives; visemes; visual speech recognition; word accuracy; Acoustic noise; Auditory system; Automatic speech recognition; Data mining; Feature extraction; Hidden Markov models; Humans; Noise robustness; Speech processing; Speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Multimedia, Video and Speech Processing, 2004. Proceedings of 2004 International Symposium on

Print_ISBN

0-7803-8687-6

Type

conf

DOI

10.1109/ISIMP.2004.1434157

Filename

1434157