مرکز منطقه ای اطلاع رساني علوم و فناوري - Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation

DocumentCode :

107105

Title :

Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation

Author :

Langzhou Chen ; Braunschweiler, Norbert ; Gales, Mark J. F.

Author_Institution :

Cambridge Res. Lab., Toshiba Res. Eur. Ltd., Cambridge, UK

Volume :

Issue :

fYear :

2015

fDate :

Apr-15

Firstpage :

605

Lastpage :

618

Abstract :

Expressive synthesis from text is a challenging problem. There are two issues. First, read text is often highly expressive to convey the emotion and scenario in the text. Second, since the expressive training speech is not always available for different speakers, it is necessary to develop methods to share the expressive information over speakers. This paper investigates the approach of using very expressive, highly diverse audiobook data from multiple speakers to build an expressive speech synthesis system. Both of two problems are addressed by considering a factorized framework where speaker and emotion are modeled in separate sub-spaces of a cluster adaptive training (CAT) parametric speech synthesis system. The sub-spaces for the expressive state of a speaker and the characteristics of the speaker are jointly trained using a set of audiobooks. In this work, the expressive speech synthesis system works in two distinct modes. In the first mode, the expressive information is given by audio data and the adaptation method is used to extract the expressive information in the audio data. In the second mode, the input of the synthesis system is plain text and a full expressive synthesis system is examined where the expressive state is predicted from the text. In both modes, the expressive information is shared and transplanted over different speakers. Experimental results show that in both modes, the expressive speech synthesis method proposed in this work significantly improves the expressiveness of the synthetic speech for different speakers. Finally, this paper also examines whether it is possible to predict the expressive states from text for multiple speakers using a single model, or whether the prediction process needs to be speaker specific.

Keywords :

matrix decomposition; speaker recognition; speech synthesis; audiobook data; cluster adaptive training parametric speech synthesis system; expression factorization; expressive speech synthesis system; factorized framework; multiple speakers; speaker factorization; Acoustics; Equations; Speech; Training; Training data; Transforms; Vectors; Audiobook; cluster adaptive training; expressive speech synthesis; factorization; hidden Markov model; neural network;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

2329-9290

Type :

jour

DOI :

10.1109/TASLP.2014.2385478

Filename :

6995936

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=107105