مرکز منطقه ای اطلاع رساني علوم و فناوري - Temporal Bayesian Fusion for Affect Sensing: Combining Video, Audio, and Lexical Modalities

DocumentCode :

4455

Title :

Temporal Bayesian Fusion for Affect Sensing: Combining Video, Audio, and Lexical Modalities

Author :

Savran, Arman ; Houwei Cao ; Nenkova, Ani ; Verma, Ragini

Author_Institution :

Dept. of Radiol., Univ. of Pennsylvania, Philadelphia, PA, USA

Volume :

Issue :

fYear :

2015

fDate :

Sept. 2015

Firstpage :

1927

Lastpage :

1941

Abstract :

The affective state of people changes in the course of conversations and these changes are expressed externally in a variety of channels, including facial expressions, voice, and spoken words. Recent advances in automatic sensing of affect, through cues in individual modalities, have been remarkable; yet emotion recognition is far from a solved problem. Recently, researchers have turned their attention to the problem of multimodal affect sensing in the hope that combining different information sources would provide great improvements. However, reported results fall short of the expectations, indicating only modest benefits and occasionally even degradation in performance. We develop temporal Bayesian fusion for continuous real-value estimation of valence, arousal, power, and expectancy dimensions of affect by combining video, audio, and lexical modalities. Our approach provides substantial gains in recognition performance compared to previous work. This is achieved by the use of a powerful temporal prediction model as prior in Bayesian fusion as well as by incorporating uncertainties about the unimodal predictions. The temporal prediction model makes use of time correlations on the affect sequences and employs estimated temporal biases to control the affect estimations at the beginning of conversations. In contrast to other recent methods for combination of modalities our model is simpler, since it does not model relationships between modalities and involves only a few interpretable parameters to be estimated from the training data.

Keywords :

Bayes methods; audio signal processing; emotion recognition; image fusion; video signal processing; affect expectancy dimensions; affect sequences; arousal estimation; audio modality; continuous real-value valence estimation; emotion recognition; facial expressions; information sources; lexical modality; multimodal affect sensing problem; power estimation; spoken words; temporal Bayesian fusion; temporal prediction model; time correlations; video modality; voice; Acoustics; Bayes methods; Correlation; Databases; Face; Predictive models; Training; Acoustic; Bayesian fusion; affective computing; arousal; emotion recognition; facial expressions; lexical; multimodal; particle filter; power; speech; temporal fusion; turn-based; valence;

fLanguage :

English

Journal_Title :

Cybernetics, IEEE Transactions on

Publisher :

ieee

ISSN :

2168-2267

Type :

jour

DOI :

10.1109/TCYB.2014.2362101

Filename :

6930787

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=4455