Title :
LOTUS-BN: A Thai broadcast news corpus and its research applications
Author :
Chotimongkol, Ananlada ; Saykhum, Kwanchiva ; Chootrakool, Patcharika ; Thatphithakkul, Nattanun ; Wutiwiwatchai, Chai
Author_Institution :
Nat. Electron. & Comput. Technol. Center (NECTEC), Pathumthani, Thailand
Abstract :
This paper describes the design and construction of the LOTUS-BN corpus, a Thai television broadcast news corpus. In addition to audio recordings and their transcription, this corpus also includes a detailed annotation of many interesting characteristics of broadcast news data such as acoustic condition, overlapping speech, news topic and named entity. The LOTUS-BN is still an ongoing project with the goal of collecting 100 hours of speech. We report initial statistics analyzed from 60 hours of speech which show that the LOTUS-BN corpus has a rich vocabulary of approximately 26,000 words with one third of them are named entities. Thus, this corpus is a good resource for developing an LVCSR system and investigating on named entity detection and recognition in addition to broadcast news related applications. Research applications on these topics are also discussed.
Keywords :
broadcasting; multimedia databases; speech recognition; LOTUS-BN corpus; Thailand; acoustic condition; broadcast news data; named entity recognition; news topic; overlapping speech; television broadcast news corpus; Application software; Audio recording; Broadcast technology; Loudspeakers; Multimedia communication; Speech processing; Speech recognition; Statistical analysis; TV broadcasting; Vocabulary;
Conference_Titel :
Speech Database and Assessments, 2009 Oriental COCOSDA International Conference on
Conference_Location :
Urumqi
Print_ISBN :
978-1-4244-4400-7
Electronic_ISBN :
978-1-4244-4400-7
DOI :
10.1109/ICSDA.2009.5278377