مرکز منطقه ای اطلاع رساني علوم و فناوري - A cascade image transform for speaker independent automatic speechreading

DocumentCode :

2454508

Title :

A cascade image transform for speaker independent automatic speechreading

Author :

Potamianos, G. ; Verma, A. ; Neti, C. ; Iyengar, G. ; Basu, S.

Author_Institution :

IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA

Volume :

fYear :

2000

fDate :

2000

Firstpage :

1097

Abstract :

We propose a three-stage pixel based visual front end for automatic speechreading (lipreading) that results in improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied to a three-dimensional video region of interest that contains the speaker´s mouth area. The first stage is a typical image compression transform that achieves a high “energy”, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis based data projection, which is applied to a concatenation of a small number of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform. Such a transform optimizes the likelihood of the observed data under the assumption of their class conditional Gaussian distribution with diagonal covariance. We apply the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 7-hour long, large vocabulary, continuous speech audio-visual dataset. We demonstrate significant classification accuracy gains by each added stage of the proposed algorithm, which, when combined, can reach up to 27% improvement. Overall, we achieve a 49% (38%) visual-only frame level phonetic classification accuracy with (without) use of test set phone boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end

Keywords :

Gaussian distribution; discrete cosine transforms; discrete wavelet transforms; image classification; image recognition; speech recognition; video coding; 3D video region; audio-visual phonetic classification; cascade image transform; class conditional Gaussian distribution; data rotation; diagonal covariance; image compression transform; large vocabulary continuous speech audio-visual dataset; linear discriminant analysis based data projection; maximum likelihood linear transform; phoneme recognition performance; reduced-dimensionality video data representation; speaker independent automatic speechreading; speaker mouth area; spoken word recognition performance; three-stage pixel based visual front end; visual-only phonetic classification; visual-only visemic classification; Automatic speech recognition; Data compression; Data mining; Discrete cosine transforms; Discrete wavelet transforms; Gaussian distribution; Image coding; Linear discriminant analysis; Mouth; Video compression;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on

Conference_Location :

New York, NY

Print_ISBN :

0-7803-6536-4

Type :

conf

DOI :

10.1109/ICME.2000.871552

Filename :

871552

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2454508