مرکز منطقه ای اطلاع رساني علوم و فناوري - A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction

DocumentCode :

8172

Title :

A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction

Author :

Tsai, T.J. ; Stolcke, Andreas ; Slaney, Malcolm

Author_Institution :

Microsoft Res., Berkeley, CA, USA

Volume :

Issue :

fYear :

2015

fDate :

Sept. 2015

Firstpage :

1550

Lastpage :

1561

Abstract :

The goal of addressee detection is to answer the question , “Are you talking to me?” When a dialogue system interacts with multiple users, it is crucial to detect when a user is speaking to the system as opposed to another person. We study this problem in a multimodal scenario, using lexical, acoustic, visual, dialogue state, and beamforming information. Using data from a multiparty dialogue system, we quantify the benefits of using multiple modalities over using a single modality. We also assess the relative importance of the various modalities, as well as of key individual features, in estimating the addressee. We find that energy-based acoustic features are by far the most important, that information from speech recognition and system state is useful as well, and that visual and beamforming features provide little additional benefit. While we find that head pose is affected by whom the speaker is addressing, it yields little nonredundant information due to the system acting as a situational attractor. Our findings would be relevant to multiparty, open-world dialogue systems in which the agent plays an active, conversational role, such as an interactive assistant deployed in a public, open space. For these scenarios , our study suggests that acoustic, lexical, and system-state information is an effective and practical combination of modalities to use for addressee detection. We also consider how our analyses might be affected by the ongoing development of more realistic, natural dialogue systems.

Keywords :

array signal processing; audio user interfaces; human computer interaction; interactive systems; speaker recognition; acoustic state; active conversational role; beamforming features; beamforming information; dialogue state; energy-based acoustic features; head pose; human-human-computer interaction; interactive assistant; lexical state; multimodal addressee detection; multimodal scenario; multiparty open-world dialogue systems; multiple modalities; speech recognition; visual features; visual state; Acoustics; Computational modeling; Computers; Face; Feature extraction; Speech; Visualization; Addressee detection; beamforming; dialogue system; head pose; human-human-computer; multimodal; multiparty; prosody; speech recognition;

fLanguage :

English

Journal_Title :

Multimedia, IEEE Transactions on

Publisher :

ieee

ISSN :

1520-9210

Type :

jour

DOI :

10.1109/TMM.2015.2454332

Filename :

7153545

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=8172