Title :
Unsupervised prosodic phrase boundary labeling of Mandarin speech synthesis database using context-dependent HMM
Author :
Chen-Yu Yang ; Zhen-Hua Ling ; Li-Rong Dai
Author_Institution :
Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
Abstract :
In this paper, an automatic and unsupervised method based on context-dependent hidden Markov model (CD-HMM) is proposed for labeling the phrase boundary positions of a Mandarin speech synthesis database. The initial phrase boundary labels are predicted by clustering the durations of the pauses between every two prosodic words in an unsupervised way. Then, the CD-HMMs for the spectrum, F0 and phone duration are estimated by a means similar to the HMM-based parametric speech synthesis using the initial phrase boundary labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and Viterbi decoding procedures are conducted iteratively until convergence. Experimental results on a Mandarin speech synthesis database show that this method is able to label the phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels.
Keywords :
Viterbi decoding; acoustic signal processing; frequency estimation; hidden Markov models; maximum likelihood decoding; natural language processing; pattern clustering; spectral analysis; speech synthesis; CD-HMM; HMM-based parametric speech synthesis; Mandarin speech synthesis database; Viterbi decoding; acoustic feature sequences; automatic method; context-dependent HMM; context-dependent hidden Markov model; fundamental frequency estimation; initial phrase boundary label prediction; maximum likelihood criterion; model training procedures; pause duration clustering; phone duration estimation; phrase boundary position labeling; prosodic words; spectrum estimation; unit selection speech synthesis system; unsupervised prosodic phrase boundary labeling; Acoustics; Databases; Hidden Markov models; Labeling; Speech; Speech synthesis; Training; Viterbi decoding; context-dependent hidden Markov model; phrase boundary; speech synthesis; unsupervised labeling;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
Conference_Location :
Vancouver, BC
DOI :
10.1109/ICASSP.2013.6638994