Recent improvements to the IBM trainable speech synthesis system

Author

Eide, E. ; Aaron, A. ; Bakis, R. ; Cohen, P. ; Donovan, R. ; Hamza, W. ; Mathes, T. ; Picheny, M. ; Polkosky, M. ; Smith, M. ; Viswanathan, M.

Author_Institution

IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA

Volume

1

fYear

2003

fDate

6-10 April 2003

Abstract

In this paper we describe the current status of the trainable text-to-speech system at IBM. Recent algorithmic and database changes to the system have led to significant gains in the output quality. On the algorithms side, we have introduced statistical models for predicting pitch and duration targets which replace the rule-based target generation previously employed. Additionally, we have changed the cost function and the search strategy, introduced a post-search pitch smoothing algorithm, and improved our method of preselection. Through the combined data and algorithmic contributions, we have been able to significantly improve (p < 0.0001) the mean opinion score (MOS) of our female voice, from 3.68 to 4.85 when heard over loudspeakers and to 5.42 when heard over the telephone (seven point scale).

Keywords

frequency estimation; prediction theory; search problems; smoothing methods; speech synthesis; statistical analysis; IBM trainable speech synthesis system; algorithmic changes; cost function; database changes; duration; mean opinion score; output quality; pitch prediction; post-search pitch smoothing algorithm; preselection; search strategy; statistical models; text-to-speech system; Cost function; Databases; Decision trees; Knowledge based systems; Signal generators; Signal processing algorithms; Smoothing methods; Speech processing; Speech synthesis; Stress;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on

ISSN

1520-6149

Print_ISBN

0-7803-7663-3

Type

conf

DOI

10.1109/ICASSP.2003.1198879

Filename

1198879