A unified trajectory tiling approach to high quality TTS and cross-lingual voice transformation

Author

Yao Qian ; Soong, Frank K.

Author_Institution

Microsoft Res. Asia, Beijing, China

fYear

2012

fDate

5-8 Dec. 2012

Firstpage

165

Lastpage

169

Abstract

In human-machine speech communication, it is technically challenging to make the machine talk as naturally as human so as to facilitate “frictionless” interactions, or make a human user to feel the communication is as natural as human-human. We propose a trajectory tiling approach to high quality speech synthesis, where the speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform segment “tiles” stored in a pre-recorded speech database. We test our approach in both TTS and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality speech is also confirmed in objective and subjective tests.

Keywords

human computer interaction; speech intelligibility; speech synthesis; waveform analysis; best waveform segment tile sequence search; cross-lingual voice transformation applications; high quality TTS; high quality speech synthesis; human-machine speech communication; natural communication; speech database; speech intelligibility; speech parameter trajectories; speech rendering; unified trajectory tiling approach; Hidden Markov models; Rendering (computer graphics); Speech; Speech synthesis; Tiles; Training data; Trajectory; cross-lingual; speech synthesis; trajectory tiling; voice transformation;

fLanguage

English

Publisher

ieee

Conference_Titel

Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on

Conference_Location

Kowloon

Print_ISBN

978-1-4673-2506-6

Electronic_ISBN

978-1-4673-2505-9

Type

conf

DOI

10.1109/ISCSLP.2012.6423506

Filename

6423506