Title :
System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search
Author :
Nouza, Jan ; Zdansky, Jindrich ; Cerva, Petr
Author_Institution :
Fac. of Mechatron., Tech. Univ. of Liberec, Liberec, Czech Republic
Abstract :
In the paper we describe a complex system we developed for automatic acquisition of a large corpus of spoken Czech. The system is capable of continuous monitoring of a selected Czech TV station and providing automatic transcription of its audio track. The transcription is performed by our own speech recognition engine that employs a vocabulary with 350 thousand most frequent Czech words (and word-forms). Transcription accuracy is fairly good for studio speech (above 90 per cent), but may drop significantly for noisy recordings and spontaneous speech. Anyway, the system runs without any human supervision and during its operation in 2007 it collected, transcribed, stored and indexed more than 1800 hours of Czech spoken documents. Any word or word combination in this corpus can be easily searched by a full-text search engine with Internet access.
Keywords :
indexing; natural languages; speech recognition equipment; Czech broadcast speech indexing; automatic collection; automatic transcription; full-text search; speech recognition engine; Audio recording; Humans; Indexing; Internet; Radio broadcasting; Search engines; Signal processing; Speech processing; Speech recognition; TV broadcasting;
Conference_Titel :
MELECON 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference
Conference_Location :
Valletta
Print_ISBN :
978-1-4244-5793-9
DOI :
10.1109/MELCON.2010.5476306