Building HMM-TTS Voices on Diverse Data

Author

Wan, Vincent ; Latorre, Javier ; Yanagisawa, Kei ; Braunschweiler, Norbert ; Langzhou Chen ; Gales, Mark J.F. ; Akamine, Masami

Author_Institution

Speech Technol. Group, Toshiba Res. Eur. Ltd., Cambridge, UK

Volume

8

Issue

2

fYear

2014

fDate

Apr-14

Firstpage

296

Lastpage

306

Abstract

The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are typically built using homogeneous data. It is possible to acquire data from many different sources but combining them leads to a non-homogeneous or diverse dataset. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) with multiple context dependent decision trees to create HMM-TTS voices using diverse data: speech data recorded in studios mixed with speech data obtained from the internet. Training AVM and CAT models on diverse data yields better quality speech than training on high quality studio data alone. Tests show that CAT is able to create a voice for a target speaker with as little as 7 seconds; an AVM would need more data to reach the same level of similarity to target speaker. Tests also show that CAT produces higher quality voices than AVMs irrespective of the amount of adaptation data. Lastly, it is shown that it is beneficial to model the data using multiple context clustering decision trees.

Keywords

data acquisition; decision trees; hidden Markov models; learning (artificial intelligence); pattern clustering; speech synthesis; statistical analysis; AVM; CAT; HMM-TTS system; Internet; average voice model; cluster adaptive training; data acquisition; hidden Markov model based text-to-speech system; homogeneous diverse dataset; multiple context clustering decision tree; multiple context dependent decision tree; nonhomogeneous diverse dataset; speech data recording; speech quality; statistical model; target speaker; Adaptation models; Data models; Decision trees; Hidden Markov models; Speech; Training; Vectors; Average voice models; cluster adaptive training; speaker adaptation; speech synthesis;

fLanguage

English

Journal_Title

Selected Topics in Signal Processing, IEEE Journal of

Publisher

ieee

ISSN

1932-4553

Type

jour

DOI

10.1109/JSTSP.2013.2295058

Filename

6687250