Cluster adaptive training of average voice models

Author

Wan, Vincent ; Latorre, Javier ; Yanagisawa, Kei ; Gales, Mark ; Stylianou, Yannis

Author_Institution

Toshiba Res. Eur. Ltd., Cambridge, UK

fYear

2014

fDate

4-9 May 2014

Firstpage

280

Lastpage

284

Abstract

Hidden Markov model based text-to-speech systems may be adapted so that the synthesised speech sounds like a particular person. The average voice model (AVM) approach uses linear transforms to achieve this while multiple decision tree cluster adaptive training (CAT) represents different speakers as points in a low dimensional space. This paper describes a novel combination of CAT and AVM for modelling speakers. CAT yields higher quality synthetic speech than AVMs but AVMs model the target speaker better. The resulting combination may be interpreted as a more powerful version of the AVM. Results show that the combination achieves better target speaker similarity when compared with both AVM and CAT while the speech quality is in-between AVM and CAT.

Keywords

decision trees; maximum likelihood estimation; pattern clustering; speech synthesis; transforms; AVM approach; CAT; average voice model approach; hidden Markov model based text-to-speech systems; linear transforms; low dimensional space; multiple decision tree cluster adaptive training; speaker modelling; speech quality; synthesised speech; target speaker similarity; Adaptation models; Decision trees; Hidden Markov models; Speech; Training; Transforms; Vectors; Speech synthesis; average voice model; cluster adaptive training; voice cloning;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on

Conference_Location

Florence

Type

conf

DOI

10.1109/ICASSP.2014.6853602

Filename

6853602