• DocumentCode
    1411064
  • Title

    Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels

  • Author

    Wu, Chung-Hsien ; Liang, Wei-Bin

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
  • Volume
    2
  • Issue
    1
  • fYear
    2011
  • Firstpage
    10
  • Lastpage
    21
  • Abstract
    This work presents an approach to emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information (AP) and semantic labels (SLs). For AP-based recognition, acoustic and prosodic features including spectrum, formant, and pitch-related features are extracted from the detected emotional salient segments of the input speech. Three types of models, GMMs, SVMs, and MLPs, are adopted as the base-level classifiers. A Meta Decision Tree (MDT) is then employed for classifier fusion to obtain the AP-based emotion recognition confidence. For SL-based recognition, semantic labels derived from an existing Chinese knowledge base called HowNet are used to automatically extract Emotion Association Rules (EARs) from the recognized word sequence of the affective speech. The maximum entropy model (MaxEnt) is thereafter utilized to characterize the relationship between emotional states and EARs for emotion recognition. Finally, a weighted product fusion method is used to integrate the AP-based and SL-based recognition results for the final emotion decision. For evaluation, 2,033 utterances for four emotional states (Neutral, Happy, Angry, and Sad) are collected. The speaker-independent experimental results reveal that the emotion recognition performance based on MDT can achieve 80.00 percent, which is better than each individual classifier. On the other hand, an average recognition accuracy of 80.92 percent can be obtained for SL-based recognition. Finally, combining acoustic-prosodic information and semantic labels can achieve 83.55 percent, which is superior to either AP-based or SL-Based approaches. Moreover, considering the individual personality trait for personalized application, the recognition accuracy of the proposed approach can be further improved to 85.79 percent.
  • Keywords
    data mining; decision trees; emotion recognition; feature extraction; pattern classification; speech recognition; Chinese knowledge base; EAR extraction; HowNet; acoustic prosodic information; classifier fusion; emotion association rule; emotion recognition; formant feature; meta decision tree; multiple classifier based affective speech recognition; personality trait; pitch related feature extraction; semantic label; spectrum feature; Decision trees; Emotion recognition; Feature extraction; Probability distribution; Semantics; Speech recognition; Emotion recognition; acoustic-prosodic features; meta decision trees; personality trait.; semantic labels;
  • fLanguage
    English
  • Journal_Title
    Affective Computing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1949-3045
  • Type

    jour

  • DOI
    10.1109/T-AFFC.2010.16
  • Filename
    5674019