Title :
Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?
Author :
Latorre, Javier ; Gales, Mark J F ; Buchholz, Sabine ; Knill, Kate ; Tamurd, Masatsune ; Ohtani, Yamato ; Akamine, Masami
Author_Institution :
Cambridge Res. Lab., Toshiba Res. Eur. Ltd., Cambridge, UK
Abstract :
Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.
Keywords :
hidden Markov models; speech processing; HMM-based TTS; continuous F0; source-excitation; source-excitation generation; speech signal processing; state-specific MSD-prior; voice classification; Equations; Generators; Hidden Markov models; Indexes; Mathematical model; Continuous F0; HMM-based synthesis; aperiodicity; multi-band mixed excitation; voiced/unvoiced decision;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on
Conference_Location :
Prague
Print_ISBN :
978-1-4577-0538-0
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2011.5947410