DocumentCode :
739172
Title :
A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis
Author :
Ling-Hui Chen ; Raitio, Tuomo ; Valentini-Botinhao, Cassia ; Zhen-Hua Ling ; Yamagishi, Junichi
Author_Institution :
Nat. Eng. Lab. for Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
Volume :
23
Issue :
11
fYear :
2015
Firstpage :
2003
Lastpage :
2014
Abstract :
The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds “muffled.” One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.
Keywords :
Boltzmann machines; belief networks; cepstral analysis; hidden Markov models; neural nets; probability; speech enhancement; speech synthesis; statistical analysis; BAM; DBN; DNN; HMM; RBM; bidirectional associative memory; conditional probability; deep belief network; deep generative architecture; deep neural network; female speaker; global variance; hidden Markov model; male speaker; mel-cepstral domain; modulation spectrum-based enhancement; parameter generation; probabilistic postfilter; restricted Boltzmann machine; spectral domain; spectral structure; speech quality; statistical parametric speech synthesis; synthetic voice; Acoustics; Hidden Markov models; Modulation; Natural languages; Speech; Speech synthesis; Trajectory; Deep generative architecture; hidden Markov model (HMM); modulation spectrum; postfilter; segmental quality; speech synthesis;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
2329-9290
Type :
jour
DOI :
10.1109/TASLP.2015.2461448
Filename :
7169536
Link To Document :
بازگشت