• DocumentCode
    1224282
  • Title

    Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study

  • Author

    Busso, Carlos ; Narayanan, Shrikanth S.

  • Author_Institution
    Integrated Media Syst. Center, Univ. of Southern California, Los Angeles, CA
  • Volume
    15
  • Issue
    8
  • fYear
    2007
  • Firstpage
    2331
  • Lastpage
    2347
  • Abstract
    The verbal and nonverbal channels of human communication are internally and intricately connected. As a result, gestures and speech present high levels of correlation and coordination. This relationship is greatly affected by the linguistic and emotional content of the message. The present paper investigates the influence of articulation and emotions on the interrelation between facial gestures and speech. The analyses are based on an audio-visual database recorded from an actress with markers attached to her face, who was asked to read semantically neutral sentences, expressing four emotion states (neutral, sadness, happiness, and anger). A multilinear regression framework is used to estimate facial features from acoustic speech parameters. The levels of coupling between the communication channels are quantified by using Pearson´s correlation between the recorded and estimated facial features. The results show that facial and acoustic features are strongly interrelated, showing levels of correlation higher than r = 0.8 when the mapping is computed at sentence-level using spectral envelope speech features. The results reveal that the lower face region provides the highest activeness and correlation levels. Furthermore, the correlation levels present significant interemo- tional differences, which suggest that emotional content affect the relationship between facial gestures and speech. Principal component analysis (PCA) shows that the audiovisual mapping parameters are grouped in a smaller subspace, which suggests that there is an emotion-dependent structure that is preserved from across sentences. The results suggest that this internal structure seems to be easy to model when prosodic-features are used to estimate the audiovisual mapping. The results also reveal that the correlation levels within a sentence vary according to broad phonetic properties presented in the sentence. Consonants, especially unvoiced and fricative sounds, present the lowest correlation lev- els. Likewise, the results show that facial gestures are linked at different resolutions. While the orofacial area is locally connected with the speech, other facial gestures such as eyebrow motion are linked only at the sentence-level. The results presented here have important implications for applications such as facial animation and multimodal emotion recognition.
  • Keywords
    audio databases; audio-visual systems; correlation methods; emotion recognition; estimation theory; face recognition; feature extraction; principal component analysis; regression analysis; speech processing; visual databases; PCA; audio-visual database; correlation levels; emotional utterances; facial feature estimation; facial gestures; human communication; multilinear regression framework; nonverbal channels; principal component analysis; semantically neutral sentences; speech gestures; verbal channels; Audio databases; Communication channels; Emotion recognition; Eyebrows; Facial animation; Facial features; Humans; Principal component analysis; Spatial databases; Speech; Affective state; articulatory movements; facial motion; speech acoustic;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2007.905145
  • Filename
    4317558