DocumentCode :
2454508
Title :
A cascade image transform for speaker independent automatic speechreading
Author :
Potamianos, G. ; Verma, A. ; Neti, C. ; Iyengar, G. ; Basu, S.
Author_Institution :
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Volume :
2
fYear :
2000
fDate :
2000
Firstpage :
1097
Abstract :
We propose a three-stage pixel based visual front end for automatic speechreading (lipreading) that results in improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied to a three-dimensional video region of interest that contains the speaker´s mouth area. The first stage is a typical image compression transform that achieves a high “energy”, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis based data projection, which is applied to a concatenation of a small number of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform. Such a transform optimizes the likelihood of the observed data under the assumption of their class conditional Gaussian distribution with diagonal covariance. We apply the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 7-hour long, large vocabulary, continuous speech audio-visual dataset. We demonstrate significant classification accuracy gains by each added stage of the proposed algorithm, which, when combined, can reach up to 27% improvement. Overall, we achieve a 49% (38%) visual-only frame level phonetic classification accuracy with (without) use of test set phone boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end
Keywords :
Gaussian distribution; discrete cosine transforms; discrete wavelet transforms; image classification; image recognition; speech recognition; video coding; 3D video region; audio-visual phonetic classification; cascade image transform; class conditional Gaussian distribution; data rotation; diagonal covariance; image compression transform; large vocabulary continuous speech audio-visual dataset; linear discriminant analysis based data projection; maximum likelihood linear transform; phoneme recognition performance; reduced-dimensionality video data representation; speaker independent automatic speechreading; speaker mouth area; spoken word recognition performance; three-stage pixel based visual front end; visual-only phonetic classification; visual-only visemic classification; Automatic speech recognition; Data compression; Data mining; Discrete cosine transforms; Discrete wavelet transforms; Gaussian distribution; Image coding; Linear discriminant analysis; Mouth; Video compression;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on
Conference_Location :
New York, NY
Print_ISBN :
0-7803-6536-4
Type :
conf
DOI :
10.1109/ICME.2000.871552
Filename :
871552
Link To Document :
بازگشت