• DocumentCode
    672837
  • Title

    The development and analysis of a Malay broadcasr news corpus

  • Author

    Tze Yuang Chong ; Xiong Xiao ; Haihua Xu ; Tien-Ping Tan ; Pham Chau-Khoa ; Dau-Cheng Lyu ; Eng Siong Chng ; Haizhou Li

  • Author_Institution
    Temasek Labs., Nanyang Technol. Univ., Singapore, Singapore
  • fYear
    2013
  • fDate
    25-27 Nov. 2013
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    This paper presents our effort in collecting a Malay broadcast news (BN) speech corpus to support our research in Malay LVCSR. The 53 hours corpus is recorded from the TV channels in both Singapore and Malaysia over a 9-month period. To facilitate various researches in LVCSR, besides of orthographic transcription, the corpus provides other metadata such as speaking environment type, speaker identity information, language identity, and topic descriptions. In the orthographic transcription, we also tagged various linguistic phenomena such as disfluencies, code switched words, and proper nouns. We trained an ASR system and achieved a word error rate of 8.5% for anchor speech and 17.1% overall (including reporter and other speakers speech) on 27 hours of test data.
  • Keywords
    audio databases; linguistics; meta data; natural languages; speech processing; speech recognition; ASR system; BN speech corpus; Malay LVCSR; Malay broadcast news speech corpus; Malaysia; Singapore; TV channels; anchor speech; code switched words; language identity; linguistic phenomena; metadata; orthographic transcription; reporter speech; speaker identity information; speaker speech; speaking environment type; topic descriptions; word error rate; Acoustics; Interviews; Noise; Speech; Speech recognition; Switches; TV; Malay; Speech corpus; broadcast news; speech recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference
  • Conference_Location
    Gurgaon
  • Type

    conf

  • DOI
    10.1109/ICSDA.2013.6709862
  • Filename
    6709862