A New Prosody-Assisted Mandarin ASR System

Author

Chen, Sin-Horng ; Yang, Jyh-Her ; Chiang, Chen-Yu ; Liu, Ming-Chieh ; Wang, Yih-Ru

Author_Institution

Dept. of Electr. Eng., Nat. Chiao Tung Univ., Hsinchu, Taiwan

Volume

20

Issue

6

fYear

2012

Firstpage

1669

Lastpage

1684

Abstract

This paper presents a new prosody-assisted automatic speech recognition (ASR) system for Mandarin speech. It differs from the conventional approach of using simple prosodic cues on employing a sophisticated prosody modeling approach based on a four-layer prosody-hierarchy structure to automatically generate 12 prosodic models from a large unlabeled speech database by the joint prosody labeling and modeling (PLM) algorithm proposed previously. By incorporating these 12 prosodic models into a two-stage ASR system to rescore the word lattice generated in the first stage by the conventional hidden Markov model (HMM) recognizer, we can obtain a better recognized word string. Besides, some other information can also be decoded, including part of speech (POS), punctuation mark (PM), and two types of prosodic tags which can be used to construct the prosody-hierarchy structure of the testing speech. Experimental results on the TCC300 database, which consists of long paragraphic utterances, showed that the proposed system significantly outperformed the baseline scheme using an HMM recognizer with a factored language model which models word, POS, and PM. Performances of 20.7%, 14.4%, and 9.6% in word, character, and base-syllable error rates were obtained. They corresponded to 3.7%, 3.7%, and 2.4% absolute (or 15.2%, 20.4%, and 20% relative) error reductions. By an error analysis, we found that many word segmentation errors and tone recognition errors were corrected.

Keywords

error analysis; hidden Markov models; speech recognition; HMM recognizer; Mandarin speech; TCC300 database; automatic speech recognition system; error analysis; hidden Markov model; part of speech; prosody labeling and modeling algorithm; prosody-assisted Mandarin ASR system; prosody-hierarchy structure; punctuation mark; tone recognition errors; word segmentation errors; Acoustics; Databases; Hidden Markov models; Labeling; Pragmatics; Speech; Speech recognition; Prosody modeling; prosody-assisted automatic speech recognition (ASR); prosody-hierarchy structure;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher

ieee

ISSN

1558-7916

Type

jour

DOI

10.1109/TASL.2012.2187192

Filename

6148262