مرکز منطقه ای اطلاع رساني علوم و فناوري - Learning Salient Features for Speech Emotion <newline/>Recognition Using Convolutional <newline/>Neural Networks

DocumentCode :

1755723

Title :

Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks

Author :

Qirong Mao ; Ming Dong ; Zhengwei Huang ; Yongzhao Zhan

Author_Institution :

Dept. of Comput. Sci. & Commun. Eng, Jiangsu Univ., Zhenjiang, China

Volume :

Issue :

fYear :

2014

fDate :

Dec. 2014

Firstpage :

2203

Lastpage :

2213

Abstract :

As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect- related , discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.

Keywords :

convolution; emotion recognition; feature extraction; neural nets; signal reconstruction; speech recognition; CNN; LIF; SAE; SDFA; SER; affect-salient feature; complex scenes; convolutional neural networks; feature extractor; feature saliency; human emotional behavior understanding; human-centered signal processing; local invariant feature; objective function; orthogonality; reconstruction penalization; robust recognition performance; salient discriminative feature analysis; salient features; sparse auto-encoder; speech emotion recognition; Acoustics; Convolution; Emotion recognition; Feature extraction; Spectrogram; Speech; Speech recognition; Affective-salient discriminative feature analysis; convolutional neural networks; feature learning; speech emotion recognition;

fLanguage :

English

Journal_Title :

Multimedia, IEEE Transactions on

Publisher :

ieee

ISSN :

1520-9210

Type :

jour

DOI :

10.1109/TMM.2014.2360798

Filename :

6913013

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1755723