An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model

Author

Chen, Ken ; Hasegawa-Johnson, Mark ; Cohen, Aaron

Author_Institution

Dept. of Electr. & Comput. Eng., Illinois Univ., Urbana, IL, USA

Volume

1

fYear

2004

fDate

17-21 May 2004

Abstract

Automatic prosody labeling is important for both speech synthesis and automatic speech understanding. Humans use both syntactic cues and acoustic cues to develop their prediction of prosody for a given utterance. This process can be effectively modeled by an ANN-based syntactic-prosodic model that predicts prosody from syntax and a GMM-based acoustic-prosodic model that predicts prosody from acoustic-prosodic observations. Our experiments on the Radio News Corpus show that ANN is effective in learning the stochastic mapping from the syntactic representation of word strings to prosody labels, with an accuracy of 82.7% for pitch accent labeling and 90.5% for intonational phrase boundary (IPB) labeling. When acoustic observations and reasonably accurate phoneme transcriptions are given, a GMM-based acoustic-prosodic model, coupled with the syntactical-prosodic model, can achieve 84% pitch accent recognition accuracy and 93% IPB recognition accuracy. These results are obtained using different speakers for training and testing and have considerably exceeded all previously reported results on the same corpus, especially for the task of IPB detection.

Keywords

Gaussian distribution; neural nets; prediction theory; speech intelligibility; speech processing; speech recognition; speech synthesis; stochastic processes; ANN; GMM; Radio News Corpus; accurate phoneme transcriptions; acoustic observations; acoustic-prosodic model; automatic prosody labeling system; automatic speech understanding; intonational phrase boundary labeling; pitch accent labeling; pitch accent recognition accuracy; prediction; prosody labels; speech synthesis; stochastic mapping learning; syntactic representation; syntactic-prosodic model; word strings; Acoustic signal detection; Acoustic testing; Data mining; Humans; Labeling; Loudspeakers; Predictive models; Speech synthesis; Stochastic processes; Text recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on

ISSN

1520-6149

Print_ISBN

0-7803-8484-9

Type

conf

DOI

10.1109/ICASSP.2004.1326034

Filename

1326034