DocumentCode :
2981154
Title :
Audio-Visual ASR from Multiple Views inside Smart Rooms
Author :
Potamianos, Gerasimos ; Lucey, Patrick
Author_Institution :
Dept. of Human Language Technol., IBM Thomas J. Watson Res. Center, Yorktown Heights, NY
fYear :
2006
fDate :
Sept. 2006
Firstpage :
35
Lastpage :
40
Abstract :
Visual information from a speaker´s mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker´s face, which is not always the case in realistic human-computer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras that closely track the subject´s head. Since however these cameras are fixed in space, they cannot necessarily obtain frontal views of the speaker. Clearly, AVASR from non-frontal views is required, as well as fusion of multiple camera views, if available. In this paper, we report our very preliminary work on this subject. In particular, we concentrate on two topics: first, the design of an AVASR system that operates on profile face views and its comparison with a traditional frontal-view AVASR system, and second, the fusion of the two systems into a multi-view frontal/profile system. We in particular describe our visual front end approach for the profile view system, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video, recorded inside the IBM smart room as part of the CHIL project. Our experiments demonstrate that AVASR is possible from profile views, however the visual modality benefit is decreased compared to frontal video data
Keywords :
audio-visual systems; home automation; human computer interaction; speech recognition; video cameras; HCI; IBM smart room; audio-visual ASR; audio-visual automatic speech recognition; frontal-view AVASR system; human-computer interaction; multiple views; multiview frontal-profile system; pan-tilt-zoom cameras; visual front end approach; visual information; visual modality; Automatic speech recognition; Data mining; Human computer interaction; Loudspeakers; Microphone arrays; Mouth; Robustness; Sensor arrays; Smart cameras; Speech recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Multisensor Fusion and Integration for Intelligent Systems, 2006 IEEE International Conference on
Conference_Location :
Heidelberg
Print_ISBN :
1-4244-0566-1
Electronic_ISBN :
1-4244-0567-X
Type :
conf
DOI :
10.1109/MFI.2006.265643
Filename :
4042060
Link To Document :
بازگشت