مرکز منطقه ای اطلاع رساني علوم و فناوري - Emotional Audio-Visual Speech Synthesis Based on PAD

DocumentCode :

1510667

Title :

Emotional Audio-Visual Speech Synthesis Based on PAD

Author :

Jia, Jia ; Zhang, Shen ; Meng, Fanbo ; Wang, Yongxin ; Cai, Lianhong

Author_Institution :

Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China

Volume :

Issue :

fYear :

2011

fDate :

3/1/2011 12:00:00 AM

Firstpage :

570

Lastpage :

582

Abstract :

Audio-visual speech synthesis is the core function for realizing face-to-face human-computer communication. While considerable efforts have been made to enable talking with computer like people, how to integrate the emotional expressions into the audio-visual speech synthesis remains largely a problem. In this paper, we adopt the notion of Pleasure-Displeasure, Arousal-Nonarousal, and Dominance-Submissiveness (PAD) 3-D-emotional space, in which emotions can be described and quantified from three different dimensions. Based on this new definition, we propose a unified model for emotional speech conversion using Boosting-Gaussian mixture model (GMM), as well as a facial expression synthesis model. We further present an emotional audio-visual speech synthesis approach. Specifically, we take the text and the target PAD values as input, and employ the text-to-speech (TTS) engine to first generate the neutral speeches. Then the Boosting-GMM is used to convert the neutral speeches to emotional speeches, and the facial expression is synthesized simultaneously. Finally, the acoustic features of the emotional speech are used to modulate the facial expression in the audio-visual speech. We designed three objective and five subjective experiments to evaluate the performance of each model and the overall approach. Our experimental results on audio-visual emotional speech datasets show that the proposed approach can effectively and efficiently synthesize natural and expressive emotional audio-visual speeches. Analysis on the results also unveil that the mutually reinforcing relationship indeed exists between audio and video information.

Keywords :

audio-visual systems; emotion recognition; human computer interaction; speech synthesis; arousal-nonarousal emotional space; boosting-GMM; boosting-Gaussian mixture model; core function; dominance-submissiveness 3D emotional space; emotional audio-visual speech synthesis; emotional expression; face-to-face human computer communication; facial expression; facial expression synthesis model; neutral speeches; pleasure-displeasure; target PAD values; text-to-speech engine; Atherosclerosis; Classification tree analysis; Computer science; Engines; Face; Facial animation; Humans; Permission; Speech analysis; Speech synthesis; Audio-visual speech; Pleasure–Displeasure, Arousal–Nonarousal, and Dominance–Submissiveness (PAD); boosting-Gaussian mixture model (GMM); emotion; facial expression;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher :

ieee

ISSN :

1558-7916

Type :

jour

DOI :

10.1109/TASL.2010.2052246

Filename :

5482024

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1510667