DocumentCode :
900411
Title :
A global, boundary-centric framework for unit selection text-to-speech synthesis
Author :
Bellegarda, Jerome R.
Author_Institution :
Speech & Language Technol. Group, Apple Comput. Inc., Cupertino, CA, USA
Volume :
14
Issue :
3
fYear :
2006
fDate :
5/1/2006 12:00:00 AM
Firstpage :
990
Lastpage :
997
Abstract :
The level of quality that can be achieved by modern concatenative text-to-speech synthesis heavily depends on the optimization criteria used in the unit selection process. While effective cost functions arise naturally for prosody assessment, the criteria typically selected to quantify discontinuities in the speech signal do not closely reflect users´ perception of the resulting acoustic waveform. This paper introduces an alternative feature extraction paradigm, which eschews general purpose Fourier analysis in favor of a modal decomposition separately optimized for each boundary region. The ensuing transform framework preserves, by construction, those properties of the waveform which are globally relevant to each concatenation considered. In addition, it leads to a novel discontinuity measure which jointly, albeit implicitly, accounts for both interframe incoherence and discrepancies in formant frequencies/bandwidths. Experimental evaluations are conducted to characterize the behavior of this new metric, first on a contiguity prediction task, and then via a systematic listening comparison using a conventional metric as baseline. The results underscore the viability of the proposed framework in quantifying the perception of discontinuity between acoustic units.
Keywords :
Fourier analysis; feature extraction; speech processing; speech synthesis; Fourier analysis; discontinuity perception; feature extraction paradigm; modal decomposition; prosody assessment; segment concatenation; text-to-speech synthesis; unit selection process; Acoustic measurements; Acoustic waves; Assembly; Bandwidth; Cost function; Feature extraction; Frequency measurement; Modal analysis; Signal synthesis; Speech synthesis; Discontinuity perception; distance measure; join cost; modal analysis; segment concatenation; text-to-speech synthesis; unit selection;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TSA.2005.858048
Filename :
1621211
Link To Document :
بازگشت