مرکز منطقه ای اطلاع رساني علوم و فناوري - A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

DocumentCode :

739172

Title :

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Author :

Ling-Hui Chen ; Raitio, Tuomo ; Valentini-Botinhao, Cassia ; Zhen-Hua Ling ; Yamagishi, Junichi

Author_Institution :

Nat. Eng. Lab. for Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China

Volume :

Issue :

fYear :

2015

Firstpage :

2003

Lastpage :

2014

Abstract :

The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds “muffled.” One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.

Keywords :

Boltzmann machines; belief networks; cepstral analysis; hidden Markov models; neural nets; probability; speech enhancement; speech synthesis; statistical analysis; BAM; DBN; DNN; HMM; RBM; bidirectional associative memory; conditional probability; deep belief network; deep generative architecture; deep neural network; female speaker; global variance; hidden Markov model; male speaker; mel-cepstral domain; modulation spectrum-based enhancement; parameter generation; probabilistic postfilter; restricted Boltzmann machine; spectral domain; spectral structure; speech quality; statistical parametric speech synthesis; synthetic voice; Acoustics; Hidden Markov models; Modulation; Natural languages; Speech; Speech synthesis; Trajectory; Deep generative architecture; hidden Markov model (HMM); modulation spectrum; postfilter; segmental quality; speech synthesis;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

2329-9290

Type :

jour

DOI :

10.1109/TASLP.2015.2461448

Filename :

7169536

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=739172