مرکز منطقه ای اطلاع رساني علوم و فناوري - Unsupervised clustering of emotion and voice styles for expressive TTS

DocumentCode :

3161475

Title :

Unsupervised clustering of emotion and voice styles for expressive TTS

Author :

Eyben, Florian ; Buchholz, S. ; Braunschweiler, Norbert ; Latorre, Javier ; Wan, Vincent ; Gales, Mark J.F. ; Knill, Kate

Author_Institution :

Cambridge Res. Lab., Toshiba Res. Eur. Ltd., Cambridge, UK

fYear :

2012

fDate :

25-30 March 2012

Firstpage :

4009

Lastpage :

4012

Abstract :

Current text-to-speech synthesis (TTS) systems are often perceived as lacking expressiveness, limiting the ability to fully convey information. This paper describes initial investigations into improving expressiveness for statistical speech synthesis systems. Rather than using hand-crafted definitions of expressive classes, an unsupervised clustering approach is described which is scalable to large quantities of training data. To incorporate this “expression cluster” information into an HMM-TTS system two approaches are described: cluster questions in the decision tree construction; and average expression speech synthesis (AESS) using cluster-based linear transform adaptation. The performance of the approaches was evaluated on audiobook data in which the reader exhibits a wide range of expressiveness. A subjective listening test showed that synthesising with AESS results in speech that better reflects the expressiveness of human speech than a baseline expression-independent system.

Keywords :

decision trees; hidden Markov models; pattern clustering; speech synthesis; statistical analysis; transforms; AESS; HMM-TTS system; audiobook data; average expression speech synthesis; baseline expression-independent system; cluster-based linear transform adaptation; decision tree construction; emotion unsupervised clustering; expression cluster information; expressive TTS; expressive class hand-crafted definitions; expressive text-to-speech synthesis; human speech expressiveness; statistical speech synthesis systems; subjective listening test; training data quantity; unsupervised clustering approach; voice styles; Context; Decision trees; Hidden Markov models; IEEE Aerospace and Electronic Systems Society; Speech; Speech synthesis; Training; Average Voice Model; Expressive synthesis; HMM-TTS; text-to-speech; unsupervised clustering;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on

Conference_Location :

Kyoto

ISSN :

1520-6149

Print_ISBN :

978-1-4673-0045-2

Electronic_ISBN :

1520-6149

Type :

conf

DOI :

10.1109/ICASSP.2012.6288797

Filename :

6288797

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3161475