DocumentCode :
417671
Title :
Towards practical deployment of audio-visual speech recognition
Author :
Potamianos, G. ; Neti, C. ; Huang, J. ; Connell, J.H. ; Chu, S. ; Libal, V. ; Marcheret, E. ; Haas, N. ; Jiang, J.
Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
Volume :
3
fYear :
2004
fDate :
17-21 May 2004
Abstract :
Much progress has been achieved during the past two decades in audio-visual automatic speech recognition (AVASR). However, challenges persist that hinder AVASR deployment in practical situations, most notably, robust and fast extraction of visual speech features. We review our efforts in overcoming this problem, based on an appearance-based visual feature representation of the speaker\´s mouth region. We cover three topics in particular. Firstly, we discuss AVASR in realistic, visually challenging domains, where lighting, background, and head-pose vary significantly. To enhance visual-front-end robustness in such environments, we employ an improved statistical-based face detection algorithm that significantly outperforms our baseline scheme. However, visual-only recognition remains inferior to visually "clean" (studio-like) data, thus demonstrating the importance of accurate mouth region extraction. We then consider a wearable audio-visual sensor to capture the mouth region directly, thus eliminating face detection. Its use improves visual-only recognition, even over full-face videos recorded in the studio-like environment. Finally, we address the speed issue in visual feature extraction, by discussing our real-time AVASR prototype implementation. The reported progress demonstrates the feasibility of practical AVASR.
Keywords :
audio-visual systems; face recognition; feature extraction; object detection; speech recognition; audio-visual automatic speech recognition; audio-visual speech recognition; face detection; mouth region extraction; visual feature extraction; visual feature representation; visual speech feature extraction; visual-front-end robustness; wearable audio-visual sensor; Automatic speech recognition; Data mining; Face detection; Feature extraction; Mouth; Prototypes; Robustness; Speech recognition; Videos; Wearable sensors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on
ISSN :
1520-6149
Print_ISBN :
0-7803-8484-9
Type :
conf
DOI :
10.1109/ICASSP.2004.1326660
Filename :
1326660
Link To Document :
بازگشت