مرکز منطقه ای اطلاع رساني علوم و فناوري - Audio-Visual ASR from Multiple Views inside Smart Rooms

DocumentCode :

2981154

Title :

Audio-Visual ASR from Multiple Views inside Smart Rooms

Author :

Potamianos, Gerasimos ; Lucey, Patrick

Author_Institution :

Dept. of Human Language Technol., IBM Thomas J. Watson Res. Center, Yorktown Heights, NY

fYear :

2006

fDate :

Sept. 2006

Firstpage :

Lastpage :

Abstract :

Visual information from a speaker´s mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker´s face, which is not always the case in realistic human-computer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras that closely track the subject´s head. Since however these cameras are fixed in space, they cannot necessarily obtain frontal views of the speaker. Clearly, AVASR from non-frontal views is required, as well as fusion of multiple camera views, if available. In this paper, we report our very preliminary work on this subject. In particular, we concentrate on two topics: first, the design of an AVASR system that operates on profile face views and its comparison with a traditional frontal-view AVASR system, and second, the fusion of the two systems into a multi-view frontal/profile system. We in particular describe our visual front end approach for the profile view system, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video, recorded inside the IBM smart room as part of the CHIL project. Our experiments demonstrate that AVASR is possible from profile views, however the visual modality benefit is decreased compared to frontal video data

Keywords :

audio-visual systems; home automation; human computer interaction; speech recognition; video cameras; HCI; IBM smart room; audio-visual ASR; audio-visual automatic speech recognition; frontal-view AVASR system; human-computer interaction; multiple views; multiview frontal-profile system; pan-tilt-zoom cameras; visual front end approach; visual information; visual modality; Automatic speech recognition; Data mining; Human computer interaction; Loudspeakers; Microphone arrays; Mouth; Robustness; Sensor arrays; Smart cameras; Speech recognition;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Multisensor Fusion and Integration for Intelligent Systems, 2006 IEEE International Conference on

Conference_Location :

Heidelberg

Print_ISBN :

1-4244-0566-1

Electronic_ISBN :

1-4244-0567-X

Type :

conf

DOI :

10.1109/MFI.2006.265643

Filename :

4042060

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2981154