Title :
Prediction of Fundamental Frequency and Voicing From Mel-Frequency Cepstral Coefficients for Unconstrained Speech Reconstruction
Author :
Milner, Ben ; Shao, Xu
Author_Institution :
Sch. of Comput. Sci., East Anglia Univ., Norwich
fDate :
6/29/1905 12:00:00 AM
Abstract :
This work proposes a method for predicting the fundamental frequency and voicing of a frame of speech from its mel-frequency cepstral coefficient (MFCC) vector representation. This information is subsequently used to enable a speech signal to be reconstructed solely from a stream of MFCC vectors and has particular application in distributed speech recognition systems. Prediction is achieved by modeling the joint density of fundamental frequency and MFCCs. This is first modeled using a Gaussian mixture model (GMM) and then extended by using a set of hidden Markov models to link together a series of state-dependent GMMs. Prediction accuracy is measured on unconstrained speech input for both a speaker-dependent system and a speaker-independent system. A fundamental frequency prediction error of 3.06% is obtained on the speaker-dependent system in comparison to 8.27% on the speaker-independent system. On the speaker-dependent system 5.22% of frames have voicing errors compared to 8.82% on the speaker-independent system. Spectrogram analysis of reconstructed speech shows that highly intelligible speech is produced with the quality of the speaker-dependent speech being slightly higher owing to the more accurate fundamental frequency and voicing predictions
Keywords :
Gaussian processes; hidden Markov models; speech processing; speech recognition; Gaussian mixture model; distributed speech recognition; frequency prediction; hidden Markov models; mel-frequency cepstral coefficient; speaker-independent system; spectrogram analysis; speech frame; speech signal; unconstrained speech reconstruction; vector representation; voicing prediction; Cepstral analysis; Frequency estimation; Hidden Markov models; Mel frequency cepstral coefficient; Predictive models; Speech analysis; Speech processing; Speech recognition; Telecommunication standards; Time domain analysis; Correlation; Gaussian mixture model (GMM); fundamental frequency; hidden Markov model (HMM); maximum a posteriori (MAP);
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2006.876880