DocumentCode
2525587
Title
System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search
Author
Nouza, Jan ; Zdansky, Jindrich ; Cerva, Petr
Author_Institution
Fac. of Mechatron., Tech. Univ. of Liberec, Liberec, Czech Republic
fYear
2010
fDate
26-28 April 2010
Firstpage
202
Lastpage
205
Abstract
In the paper we describe a complex system we developed for automatic acquisition of a large corpus of spoken Czech. The system is capable of continuous monitoring of a selected Czech TV station and providing automatic transcription of its audio track. The transcription is performed by our own speech recognition engine that employs a vocabulary with 350 thousand most frequent Czech words (and word-forms). Transcription accuracy is fairly good for studio speech (above 90 per cent), but may drop significantly for noisy recordings and spontaneous speech. Anyway, the system runs without any human supervision and during its operation in 2007 it collected, transcribed, stored and indexed more than 1800 hours of Czech spoken documents. Any word or word combination in this corpus can be easily searched by a full-text search engine with Internet access.
Keywords
indexing; natural languages; speech recognition equipment; Czech broadcast speech indexing; automatic collection; automatic transcription; full-text search; speech recognition engine; Audio recording; Humans; Indexing; Internet; Radio broadcasting; Search engines; Signal processing; Speech processing; Speech recognition; TV broadcasting;
fLanguage
English
Publisher
ieee
Conference_Titel
MELECON 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference
Conference_Location
Valletta
Print_ISBN
978-1-4244-5793-9
Type
conf
DOI
10.1109/MELCON.2010.5476306
Filename
5476306
Link To Document