• DocumentCode
    454721
  • Title

    Arabic Broadcast News Transcription Using a One Million Word Vocalized Vocabulary

  • Author

    Messaoudi, A. ; Gauvain, Jean-Luc ; Lamel, Lori

  • Author_Institution
    Spoken Language Process. Group, LIMSI-CNRS, Orsay
  • Volume
    1
  • fYear
    2006
  • fDate
    14-19 May 2006
  • Abstract
    Recently it has been shown that modeling short vowels in Arabic can significantly improve performance even when producing a non-vocalized transcript. Since Arabic texts and audio transcripts are almost exclusively non-vocalized, the training methods have to overcome this missing data problem. For the acoustic models the procedure was bootstrapped with manually vocalized data and extended with semi-automatically vocalized data. In order to also capture the vowel information in the language model, a vocalized 4-gram language model trained on the audio transcripts was interpolated with the original 4-gram model trained on the (non-vocalized) written texts. Another challenge of the Arabic language is its large lexical variety. The out-of-vocabulary rate with a 65k word vocabulary is in the range of 4-8% (compared to under 1% for English). To address this problem a vocalized vocabulary containing over 1 million vocalized words, grouped into 200k word classes is used. This reduces the out-of-vocabulary rate to about 2%. The extended vocabulary and vocalized language model trained on the manually annotated data give a 1.2% absolute word error reduction on the DARPA RT04 development data. However, including the automatically vocalized transcripts in the language model reduces performance indicating that automatic vocalization needs to be improved
  • Keywords
    acoustics; natural languages; Arabic broadcast news transcription; acoustic models; audio transcripts; language model; manually vocalized data; semi-automatically vocalized data; word error reduction; word vocalized vocabulary; Automatic speech recognition; Broadcasting; Character recognition; Dictionaries; Error analysis; Natural languages; Speech analysis; Speech recognition; Training data; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on
  • Conference_Location
    Toulouse
  • ISSN
    1520-6149
  • Print_ISBN
    1-4244-0469-X
  • Type

    conf

  • DOI
    10.1109/ICASSP.2006.1660215
  • Filename
    1660215