Title :
Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines
Author :
Nakashika, Toru ; Takiguchi, Tetsuya ; Ariki, Yasuo
Author_Institution :
Grad. Sch. of Syst. Inf., Kobe Univ., Kobe, Japan
Abstract :
This paper presents a voice conversion (VC) method that utilizes the recently proposed probabilistic models called recurrent temporal restricted Boltzmann machines (RTRBMs). One RTRBM is used for each speaker, with the goal of capturing high-order temporal dependencies in an acoustic sequence. Our algorithm starts from the separate training of one RTRBM for a source speaker and another for a target speaker using speaker-dependent training data. Because each RTRBM attempts to discover abstractions to maximally express the training data at each time step, as well as the temporal dependencies in the training data, we expect that the models represent the linguistic-related latent features in high-order spaces. In our approach, we convert (match) features of emphasis for the source speaker to those of the target speaker using a neural network (NN), so that the entire network (consisting of the two RTRBMs and the NN) acts as a deep recurrent NN and can be fine-tuned. Using VC experiments, we confirm the high performance of our method, especially in terms of objective criteria, relative to conventional VC methods such as approaches based on Gaussian mixture models and on NNs.
Keywords :
Boltzmann machines; Gaussian processes; acoustic signal processing; mixture models; probability; recurrent neural nets; speaker recognition; Gaussian mixture model; RNN; RTRBM; VC method; acoustic sequence; deep recurrent NN; high-order spaces; high-order temporal dependency; linguistic-related latent feature; neural network; probabilistic model; recurrent temporal restricted Boltzmann machines; source speaker; speaker-dependent training data; target speaker; voice conversion; Acoustics; Artificial neural networks; Data models; Speech; Training; Training data; Vectors; Deep Learning; recurrent neural network; recurrent temporal restricted Boltzmann machine (RTRBM); speaker specific features; voice conversion;
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
DOI :
10.1109/TASLP.2014.2379589