Arabic Broadcast News Transcription Using a One Million Word Vocalized Vocabulary

Author

Messaoudi, A. ; Gauvain, Jean-Luc ; Lamel, Lori

Author_Institution

Spoken Language Process. Group, LIMSI-CNRS, Orsay

Volume

1

fYear

2006

fDate

14-19 May 2006

Abstract

Recently it has been shown that modeling short vowels in Arabic can significantly improve performance even when producing a non-vocalized transcript. Since Arabic texts and audio transcripts are almost exclusively non-vocalized, the training methods have to overcome this missing data problem. For the acoustic models the procedure was bootstrapped with manually vocalized data and extended with semi-automatically vocalized data. In order to also capture the vowel information in the language model, a vocalized 4-gram language model trained on the audio transcripts was interpolated with the original 4-gram model trained on the (non-vocalized) written texts. Another challenge of the Arabic language is its large lexical variety. The out-of-vocabulary rate with a 65k word vocabulary is in the range of 4-8% (compared to under 1% for English). To address this problem a vocalized vocabulary containing over 1 million vocalized words, grouped into 200k word classes is used. This reduces the out-of-vocabulary rate to about 2%. The extended vocabulary and vocalized language model trained on the manually annotated data give a 1.2% absolute word error reduction on the DARPA RT04 development data. However, including the automatically vocalized transcripts in the language model reduces performance indicating that automatic vocalization needs to be improved

Keywords

acoustics; natural languages; Arabic broadcast news transcription; acoustic models; audio transcripts; language model; manually vocalized data; semi-automatically vocalized data; word error reduction; word vocalized vocabulary; Automatic speech recognition; Broadcasting; Character recognition; Dictionaries; Error analysis; Natural languages; Speech analysis; Speech recognition; Training data; Vocabulary;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on

Conference_Location

Toulouse

ISSN

1520-6149

Print_ISBN

1-4244-0469-X

Type

conf

DOI

10.1109/ICASSP.2006.1660215

Filename

1660215