Improvement of naturalness for an HMM-based Vietnamese speech synthesis using the prosodic information

Author

Thanh-Son Phan ; Tu-Cuong Duong ; Anh-Tuan Dinh ; Tat-Thang Vu ; Chi-Mai Luong

Author_Institution

Fac. of Inf. Technol., Le Qui Don Tech. Univ., Hanoi, Vietnam

fYear

2013

fDate

10-13 Nov. 2013

Firstpage

276

Lastpage

281

Abstract

Natural-sounding synthesized speech is goal of HMM-based Text-to-Speech systems. Besides using context dependent tri-phone units from a large corpus speech database, many prosody features have been used in full-context labels to improve naturalness of HMM-based Vietnamese synthesizer. In the prosodic specification, tone, part-of-speech (POS) and intonation information are considered not as important as positional information. Context-dependent information includes phoneme sequence as well as prosodic information because the naturalness of synthetic speech highly depends on the prosody such as pause, tone, intonation pattern, and segmental duration. In this paper, we propose decision tree questions that use context-dependent tones and investigate the impact of POS and intonation tagging on the naturalness of HMM-based voice. Experimental results show that our proposed method can improve naturalness of a HMM-based Vietnamese TTS through objective evaluation and MOS test.

Keywords

decision trees; hidden Markov models; natural language processing; speech synthesis; HMM-based Vietnamese TTS naturalness improvement; HMM-based Vietnamese speech synthesis; HMM-based text-to-speech systems; HMM-based voice; MOS test; POS; context dependent triphone units; context-dependent information; context-dependent tones; decision tree questions; full-context labels; hidden Markov models; intonation information; intonation pattern; intonation tagging; large corpus speech database; natural-sounding synthesized speech; objective evaluation; part-of-speech; pause; phoneme sequence; positional information; prosodic information; prosodic specification; prosody features; segmental duration; synthetic speech; Context; Databases; Decision trees; Hidden Markov models; Speech; Training; Vectors; HMM; HTS; Vietnamese Speech Synthesis; context-dependent; decision tree-based clustering; part-of-speech; prosodic information; tri-phone;

fLanguage

English

Publisher

ieee

Conference_Titel

Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2013 IEEE RIVF International Conference on

Conference_Location

Hanoi

Print_ISBN

978-1-4799-1349-7

Type

conf

DOI

10.1109/RIVF.2013.6719907

Filename

6719907