• DocumentCode
    177498
  • Title

    Multiple-average-voice-based speech synthesis

  • Author

    Lanchantin, Pierre ; Gales, Mark J.F. ; King, Simon ; Yamagishi, Junichi

  • Author_Institution
    Eng. Dept., Cambridge Univ., Cambridge, UK
  • fYear
    2014
  • fDate
    4-9 May 2014
  • Firstpage
    285
  • Lastpage
    289
  • Abstract
    This paper describes a novel approach for the speaker adaptation of statistical parametric speech synthesis systems based on the interpolation of a set of average voice models (AVM). Recent results have shown that the quality/naturalness of adapted voices depends on the distance from the average voice model used for speaker adaptation. This suggests the use of several AVMs trained on carefully chosen speaker clusters from which a more suitable AVM can be selected/interpolated during the adaptation. In the proposed approach a set of AVMs, a multiple-AVM, is trained on distinct clusters of speakers which are iteratively re-assigned during the estimation process initialised according to metadata. During adaptation, each AVM from the multiple-AVM is first adapted towards the target speaker. The adapted means from the AVMs are then interpolated to yield the final speaker adapted mean for synthesis. It is shown, performing speaker adaptation on a corpus of British speakers with various regional accents, that the quality/naturalness of synthetic speech of adapted voices is significantly higher than when considering a single factor-independent AVM selected according to the target speaker characteristics.
  • Keywords
    interpolation; speech synthesis; AVM; British speakers; adapted voices; average voice model; estimation process; final speaker; interpolation; metadata; multiple average voice based speech synthesis; speaker adaptation; speaker clusters; statistical parametric speech synthesis systems; synthetic speech; target speaker; target speaker characteristics; Adaptation models; Hidden Markov models; Interpolation; Speech; Speech synthesis; Training; Vectors; HMM-Based speech synthesis; cluster adaptive training; multiple average voice model; speaker adaptation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on
  • Conference_Location
    Florence
  • Type

    conf

  • DOI
    10.1109/ICASSP.2014.6853603
  • Filename
    6853603