DocumentCode :
1125755
Title :
A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS
Author :
Qian, Yao ; Liang, Hui ; Soong, Frank K.
Author_Institution :
Microsoft Res. Asia, Beijing, China
Volume :
17
Issue :
6
fYear :
2009
Firstpage :
1231
Lastpage :
1239
Abstract :
We propose a hidden Markov model (HMM)-based bilingual (Mandarin and English) text-to-speech (TTS) system to synthesize natural speech for given bilingual text. A simple baseline system consisting of two independent monolingual HMM synthesizers is built first from corresponding Mandarin and English data recorded by a bilingual speaker. A new, mixed language TTS is then constructed by asking language-independent and language-specific questions for sharing HMM states across the two languages in decision-tree based clustering. By sharing states, the new system has a smaller footprint than the baseline system. Speech synthesized by the new system sounds very similar to the baseline for non-mixed, Mandarin or English, monolingual sentences but much better for mixed-language sentences. This higher quality of mixed-language output is confirmed by a preference score, 60.2% to 39.8%, in a subjective listening test. A cross-language state mapping algorithm is further proposed for cross-language synthesis when only monolingual (English) recorded data from a source language speaker is available. Mandarin speech is then synthesized with the HMM model parameters in the nearest neighbor leaf nodes of the English decision tree. The nearest neighbor is measured with the Kullback-Leibler divergence (KLD) and mappings between leaf nodes in the decision trees of the source and target languages are established via the speech data recorded by a different, bilingual speaker. High voice (speaker) similarity is preserved in the synthesized target language sentences by using the recording of a source language from a monolingual speaker. Perceptual test results conducted on synthesized Mandarin speech show 1) high intelligibility which is confirmed by a Chinese character transcription accuracy of 92.1% and 2) decent speech quality with an average MOS score of 3.1.
Keywords :
decision trees; hidden Markov models; natural language processing; speech synthesis; English decision tree; Kullback-Leibler divergence; Mandarin speech; bilingual speaker; cross-language state sharing; decision-tree based clustering; hidden Markov model; independent monolingual HMM synthesizer; mapping approach; mixed-language sentences; speech quality; text-to-speech system; Asia; Decision trees; Engines; Hidden Markov models; Loudspeakers; Natural languages; Nearest neighbor searches; Speech processing; Speech synthesis; Testing; Bilingual; Kullback–Leibler divergence (KLD); hidden Markov model (HMM)-based speech synthesis; new language synthesis;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2009.2015708
Filename :
5153557
Link To Document :
بازگشت