DocumentCode
1224282
Title
Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study
Author
Busso, Carlos ; Narayanan, Shrikanth S.
Author_Institution
Integrated Media Syst. Center, Univ. of Southern California, Los Angeles, CA
Volume
15
Issue
8
fYear
2007
Firstpage
2331
Lastpage
2347
Abstract
The verbal and nonverbal channels of human communication are internally and intricately connected. As a result, gestures and speech present high levels of correlation and coordination. This relationship is greatly affected by the linguistic and emotional content of the message. The present paper investigates the influence of articulation and emotions on the interrelation between facial gestures and speech. The analyses are based on an audio-visual database recorded from an actress with markers attached to her face, who was asked to read semantically neutral sentences, expressing four emotion states (neutral, sadness, happiness, and anger). A multilinear regression framework is used to estimate facial features from acoustic speech parameters. The levels of coupling between the communication channels are quantified by using Pearson´s correlation between the recorded and estimated facial features. The results show that facial and acoustic features are strongly interrelated, showing levels of correlation higher than r = 0.8 when the mapping is computed at sentence-level using spectral envelope speech features. The results reveal that the lower face region provides the highest activeness and correlation levels. Furthermore, the correlation levels present significant interemo- tional differences, which suggest that emotional content affect the relationship between facial gestures and speech. Principal component analysis (PCA) shows that the audiovisual mapping parameters are grouped in a smaller subspace, which suggests that there is an emotion-dependent structure that is preserved from across sentences. The results suggest that this internal structure seems to be easy to model when prosodic-features are used to estimate the audiovisual mapping. The results also reveal that the correlation levels within a sentence vary according to broad phonetic properties presented in the sentence. Consonants, especially unvoiced and fricative sounds, present the lowest correlation lev- els. Likewise, the results show that facial gestures are linked at different resolutions. While the orofacial area is locally connected with the speech, other facial gestures such as eyebrow motion are linked only at the sentence-level. The results presented here have important implications for applications such as facial animation and multimodal emotion recognition.
Keywords
audio databases; audio-visual systems; correlation methods; emotion recognition; estimation theory; face recognition; feature extraction; principal component analysis; regression analysis; speech processing; visual databases; PCA; audio-visual database; correlation levels; emotional utterances; facial feature estimation; facial gestures; human communication; multilinear regression framework; nonverbal channels; principal component analysis; semantically neutral sentences; speech gestures; verbal channels; Audio databases; Communication channels; Emotion recognition; Eyebrows; Facial animation; Facial features; Humans; Principal component analysis; Spatial databases; Speech; Affective state; articulatory movements; facial motion; speech acoustic;
fLanguage
English
Journal_Title
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher
ieee
ISSN
1558-7916
Type
jour
DOI
10.1109/TASL.2007.905145
Filename
4317558
Link To Document