Voice conversion using deep neural network in super-frame feature space

Author

Wei Ye;Yibiao Yu

Author_Institution

School of Electronic and Information Engineering, Soochow University, Suzhou, China

fYear

2015

Firstpage

465

Lastpage

468

Abstract

This paper presents a voice conversion technique using deep neural networks (DNNs) to map the spectral envelopes of a source speaker to that of a target speaker. Short-time spectral envelopes are represented by the linear predication cepstrum coefficients (LPCC) parameters, and neighbor frames are gathered to form super-frames. Then the powerful mapping ability of DNN which has a five-layer architecture consisting of three restricted Boltzmann machines (RBMs) was exploited to derive the spectral conversion function. A comparative study of voice conversion using a DNN model and the conventional Gaussian mixture model (GMM) is conducted. Experimental results show the speaker identification rate of conversion speech achieves 97.5% which is 0.8% higher than the performance of GMM method, and the value of average cepstrum distortion is 0.87 which is 5.4% higher than the performance of GMM method. ABX and MOS evaluations indicate that the conversion performance is better than the traditional GMM method under the parallel corpora condition.

Keywords

"Yttrium","Decision support systems","Training"

Publisher

ieee

Conference_Titel

Intelligent Control and Information Processing (ICICIP), 2015 Sixth International Conference on

Print_ISBN

978-1-4799-1715-0

Type

conf

DOI

10.1109/ICICIP.2015.7388216

Filename

7388216