A global, boundary-centric framework for unit selection text-to-speech synthesis

Author

Bellegarda, Jerome R.

Author_Institution

Speech & Language Technol. Group, Apple Comput. Inc., Cupertino, CA, USA

Volume

14

Issue

3

fYear

2006

fDate

5/1/2006 12:00:00 AM

Firstpage

990

Lastpage

997

Abstract

The level of quality that can be achieved by modern concatenative text-to-speech synthesis heavily depends on the optimization criteria used in the unit selection process. While effective cost functions arise naturally for prosody assessment, the criteria typically selected to quantify discontinuities in the speech signal do not closely reflect users´ perception of the resulting acoustic waveform. This paper introduces an alternative feature extraction paradigm, which eschews general purpose Fourier analysis in favor of a modal decomposition separately optimized for each boundary region. The ensuing transform framework preserves, by construction, those properties of the waveform which are globally relevant to each concatenation considered. In addition, it leads to a novel discontinuity measure which jointly, albeit implicitly, accounts for both interframe incoherence and discrepancies in formant frequencies/bandwidths. Experimental evaluations are conducted to characterize the behavior of this new metric, first on a contiguity prediction task, and then via a systematic listening comparison using a conventional metric as baseline. The results underscore the viability of the proposed framework in quantifying the perception of discontinuity between acoustic units.

Keywords

Fourier analysis; feature extraction; speech processing; speech synthesis; Fourier analysis; discontinuity perception; feature extraction paradigm; modal decomposition; prosody assessment; segment concatenation; text-to-speech synthesis; unit selection process; Acoustic measurements; Acoustic waves; Assembly; Bandwidth; Cost function; Feature extraction; Frequency measurement; Modal analysis; Signal synthesis; Speech synthesis; Discontinuity perception; distance measure; join cost; modal analysis; segment concatenation; text-to-speech synthesis; unit selection;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher

ieee

ISSN

1558-7916

Type

jour

DOI

10.1109/TSA.2005.858048

Filename

1621211