Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
Abstract :
This work presents an approach to emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information (AP) and semantic labels (SLs). For AP-based recognition, acoustic and prosodic features including spectrum, formant, and pitch-related features are extracted from the detected emotional salient segments of the input speech. Three types of models, GMMs, SVMs, and MLPs, are adopted as the base-level classifiers. A Meta Decision Tree (MDT) is then employed for classifier fusion to obtain the AP-based emotion recognition confidence. For SL-based recognition, semantic labels derived from an existing Chinese knowledge base called HowNet are used to automatically extract Emotion Association Rules (EARs) from the recognized word sequence of the affective speech. The maximum entropy model (MaxEnt) is thereafter utilized to characterize the relationship between emotional states and EARs for emotion recognition. Finally, a weighted product fusion method is used to integrate the AP-based and SL-based recognition results for the final emotion decision. For evaluation, 2,033 utterances for four emotional states (Neutral, Happy, Angry, and Sad) are collected. The speaker-independent experimental results reveal that the emotion recognition performance based on MDT can achieve 80.00 percent, which is better than each individual classifier. On the other hand, an average recognition accuracy of 80.92 percent can be obtained for SL-based recognition. Finally, combining acoustic-prosodic information and semantic labels can achieve 83.55 percent, which is superior to either AP-based or SL-Based approaches. Moreover, considering the individual personality trait for personalized application, the recognition accuracy of the proposed approach can be further improved to 85.79 percent.
Keywords :
data mining; decision trees; emotion recognition; feature extraction; pattern classification; speech recognition; Chinese knowledge base; EAR extraction; HowNet; acoustic prosodic information; classifier fusion; emotion association rule; emotion recognition; formant feature; meta decision tree; multiple classifier based affective speech recognition; personality trait; pitch related feature extraction; semantic label; spectrum feature; Decision trees; Emotion recognition; Feature extraction; Probability distribution; Semantics; Speech recognition; Emotion recognition; acoustic-prosodic features; meta decision trees; personality trait.; semantic labels;