DocumentCode
78984
Title
Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis
Author
Takamichi, Shinnosuke ; Toda, Takechi ; Shiga, Yoshinori ; Sakti, Sakriani ; Neubig, Graham ; Nakamura, Shigenari
Author_Institution
Grad. Sch. of Inf. Sci., Nara Inst. of Sci. & Technol., Ikoma, Japan
Volume
8
Issue
2
fYear
2014
fDate
Apr-14
Firstpage
239
Lastpage
250
Abstract
In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and F0 components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.
Keywords
hidden Markov models; maximum likelihood estimation; speech synthesis; GMM; Gaussian component sequence; Gaussian mixture models; HMM; Hidden Markov Model; acoustic features; acoustic parameter segments; flexible text-to-speech synthesis; high quality text-to-speech synthesis; maximum likelihood criterion; parameter generation methods; probability distributions; rich context models; speech parameter sequence; speech probability distributions; statistical approach; unit selection synthesis; Acoustics; Context; Context modeling; Hidden Markov models; Speech; Speech synthesis; Vectors; GMM; HMM-based speech synthesis; over-smoothing; parameter generation; rich context model;
fLanguage
English
Journal_Title
Selected Topics in Signal Processing, IEEE Journal of
Publisher
ieee
ISSN
1932-4553
Type
jour
DOI
10.1109/JSTSP.2013.2288599
Filename
6654272
Link To Document