DocumentCode :
1510667
Title :
Emotional Audio-Visual Speech Synthesis Based on PAD
Author :
Jia, Jia ; Zhang, Shen ; Meng, Fanbo ; Wang, Yongxin ; Cai, Lianhong
Author_Institution :
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
Volume :
19
Issue :
3
fYear :
2011
fDate :
3/1/2011 12:00:00 AM
Firstpage :
570
Lastpage :
582
Abstract :
Audio-visual speech synthesis is the core function for realizing face-to-face human-computer communication. While considerable efforts have been made to enable talking with computer like people, how to integrate the emotional expressions into the audio-visual speech synthesis remains largely a problem. In this paper, we adopt the notion of Pleasure-Displeasure, Arousal-Nonarousal, and Dominance-Submissiveness (PAD) 3-D-emotional space, in which emotions can be described and quantified from three different dimensions. Based on this new definition, we propose a unified model for emotional speech conversion using Boosting-Gaussian mixture model (GMM), as well as a facial expression synthesis model. We further present an emotional audio-visual speech synthesis approach. Specifically, we take the text and the target PAD values as input, and employ the text-to-speech (TTS) engine to first generate the neutral speeches. Then the Boosting-GMM is used to convert the neutral speeches to emotional speeches, and the facial expression is synthesized simultaneously. Finally, the acoustic features of the emotional speech are used to modulate the facial expression in the audio-visual speech. We designed three objective and five subjective experiments to evaluate the performance of each model and the overall approach. Our experimental results on audio-visual emotional speech datasets show that the proposed approach can effectively and efficiently synthesize natural and expressive emotional audio-visual speeches. Analysis on the results also unveil that the mutually reinforcing relationship indeed exists between audio and video information.
Keywords :
audio-visual systems; emotion recognition; human computer interaction; speech synthesis; arousal-nonarousal emotional space; boosting-GMM; boosting-Gaussian mixture model; core function; dominance-submissiveness 3D emotional space; emotional audio-visual speech synthesis; emotional expression; face-to-face human computer communication; facial expression; facial expression synthesis model; neutral speeches; pleasure-displeasure; target PAD values; text-to-speech engine; Atherosclerosis; Classification tree analysis; Computer science; Engines; Face; Facial animation; Humans; Permission; Speech analysis; Speech synthesis; Audio-visual speech; Pleasure–Displeasure, Arousal–Nonarousal, and Dominance–Submissiveness (PAD); boosting-Gaussian mixture model (GMM); emotion; facial expression;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2010.2052246
Filename :
5482024
Link To Document :
بازگشت