System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search

Author

Nouza, Jan ; Zdansky, Jindrich ; Cerva, Petr

Author_Institution

Fac. of Mechatron., Tech. Univ. of Liberec, Liberec, Czech Republic

fYear

2010

fDate

26-28 April 2010

Firstpage

202

Lastpage

205

Abstract

In the paper we describe a complex system we developed for automatic acquisition of a large corpus of spoken Czech. The system is capable of continuous monitoring of a selected Czech TV station and providing automatic transcription of its audio track. The transcription is performed by our own speech recognition engine that employs a vocabulary with 350 thousand most frequent Czech words (and word-forms). Transcription accuracy is fairly good for studio speech (above 90 per cent), but may drop significantly for noisy recordings and spontaneous speech. Anyway, the system runs without any human supervision and during its operation in 2007 it collected, transcribed, stored and indexed more than 1800 hours of Czech spoken documents. Any word or word combination in this corpus can be easily searched by a full-text search engine with Internet access.

Keywords

indexing; natural languages; speech recognition equipment; Czech broadcast speech indexing; automatic collection; automatic transcription; full-text search; speech recognition engine; Audio recording; Humans; Indexing; Internet; Radio broadcasting; Search engines; Signal processing; Speech processing; Speech recognition; TV broadcasting;

fLanguage

English

Publisher

ieee

Conference_Titel

MELECON 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference

Conference_Location

Valletta

Print_ISBN

978-1-4244-5793-9

Type

conf

DOI

10.1109/MELCON.2010.5476306

Filename

5476306