• DocumentCode
    1523371
  • Title

    A Dynamic Cost Weighting Framework for Unit Selection Text–to–Speech Synthesis

  • Author

    Bellegarda, Jerome R.

  • Author_Institution
    Speech & Language Technol., Apple, Inc., Cupertino, CA, USA
  • Volume
    18
  • Issue
    6
  • fYear
    2010
  • Firstpage
    1455
  • Lastpage
    1463
  • Abstract
    Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. Constraints are normally invoked on diverse characteristics such as inter-unit discontinuity, overall pitch contour, local duration profile, etc., leading to costs often too heterogeneous for a direct quantitative comparison. In order to rank available candidate units, this complexity must be reduced to a single number, and the relative importance of each information stream becomes highly critical. Yet this influence is typically determined in an empirical manner (e.g., based on a limited amount of synthesized data), yielding global weights that are thus applied to broad classes of concatenations indiscriminately. This paper proposes an alternative approach, dynamic cost weighting, based on a data-driven framework separately optimized for each concatenation considered. Specifically, the cost distribution in every stream is dynamically leveraged on a per concatenation basis to locally shift weight towards those characteristics that offer a high discrimination between candidate units, and away from those characteristics that are intrinsically less discriminative. An illustrative case study demonstrates the potential benefits of this solution, and listening evidence suggests that it does indeed entail higher perceived TTS quality.
  • Keywords
    acoustic signal processing; natural language processing; speech processing; speech synthesis; acoustic context; candidate ranking; concatenation-specific cost weighting; cost distribution; multiple cost criteria; multiple information stream; prosodic context; unit selection text-to-speech synthesis; Candidate ranking; concatenation-specific cost weighting; concatenative speech synthesis; multiple information streams; unit selection;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2009.2035209
  • Filename
    5299072