Use of fundamental frequencies shaped by generation process model for HMM-based speech synthesis

Author

Hirose, Keikichi ; Hashimoto, Hiroya ; Hyakutake, Kyota ; Saito, Daisuke ; Minematsu, Nobuaki

Author_Institution

Dept. of Inf. & Commun. Eng., Univ. of Tokyo, Tokyo, Japan

fYear

2014

fDate

19-23 Oct. 2014

Firstpage

555

Lastpage

560

Abstract

Generation process model of fundamental frequency (F₀) contours is known to represent global movements of F₀´s keeping a clear relation with linguistic information of utterances. While HMM-based speech synthesis can generate a good quality of speech, problems, which arise from frame-by-frame processing, are pointed out. These problems are expected to be solved by incorporating the model constraints. A method is developed to use F₀ contours approximated by the model for HMM training instead of observed F₀ contours. A clear improvement in the quality of synthetic speech is shown through listening experiments. In the method, fragments of F₀ contours not represented by the model (F₀ residuals) are ignored. A scheme is further introduced to cope with the issue; F₀ residuals are also included in the training and synthesis processes of HMM-based speech synthesis, and the generated F₀ residuals are added to the model-based F₀´s before the waveform generation. The model constraint has another merit; relations between generated F₀ contours and texts are clear, and it is possible to add linguistic information such as emphasis to synthetic speech, or to change speaking styles through manipulating F₀´s in the F₀ model framework. Several experimental results supporting the advantages of the method are shown.

Keywords

hidden Markov models; speech synthesis; HMM training; HMM-based speech synthesis; frame-by-frame processing; fundamental frequency contour generation process model; hidden Markov model; listening experiment; speaking style change; synthetic speech quality; utterance linguistic information; waveform generation; Feature extraction; Hidden Markov models; Pragmatics; Speech; Speech synthesis; Training; F0 residual; Flexible F0 control; Generation process model; HMM-based speech synthesis;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing (ICSP), 2014 12th International Conference on

Conference_Location

Hangzhou

ISSN

2164-5221

Print_ISBN

978-1-4799-2188-1

Type

conf

DOI

10.1109/ICOSP.2014.7015066

Filename

7015066