Title :
A global, boundary-centric framework for unit selection text-to-speech synthesis
Author :
Bellegarda, Jerome R.
Author_Institution :
Speech & Language Technol. Group, Apple Comput. Inc., Cupertino, CA, USA
fDate :
5/1/2006 12:00:00 AM
Abstract :
The level of quality that can be achieved by modern concatenative text-to-speech synthesis heavily depends on the optimization criteria used in the unit selection process. While effective cost functions arise naturally for prosody assessment, the criteria typically selected to quantify discontinuities in the speech signal do not closely reflect users´ perception of the resulting acoustic waveform. This paper introduces an alternative feature extraction paradigm, which eschews general purpose Fourier analysis in favor of a modal decomposition separately optimized for each boundary region. The ensuing transform framework preserves, by construction, those properties of the waveform which are globally relevant to each concatenation considered. In addition, it leads to a novel discontinuity measure which jointly, albeit implicitly, accounts for both interframe incoherence and discrepancies in formant frequencies/bandwidths. Experimental evaluations are conducted to characterize the behavior of this new metric, first on a contiguity prediction task, and then via a systematic listening comparison using a conventional metric as baseline. The results underscore the viability of the proposed framework in quantifying the perception of discontinuity between acoustic units.
Keywords :
Fourier analysis; feature extraction; speech processing; speech synthesis; Fourier analysis; discontinuity perception; feature extraction paradigm; modal decomposition; prosody assessment; segment concatenation; text-to-speech synthesis; unit selection process; Acoustic measurements; Acoustic waves; Assembly; Bandwidth; Cost function; Feature extraction; Frequency measurement; Modal analysis; Signal synthesis; Speech synthesis; Discontinuity perception; distance measure; join cost; modal analysis; segment concatenation; text-to-speech synthesis; unit selection;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TSA.2005.858048