• DocumentCode
    2525587
  • Title

    System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search

  • Author

    Nouza, Jan ; Zdansky, Jindrich ; Cerva, Petr

  • Author_Institution
    Fac. of Mechatron., Tech. Univ. of Liberec, Liberec, Czech Republic
  • fYear
    2010
  • fDate
    26-28 April 2010
  • Firstpage
    202
  • Lastpage
    205
  • Abstract
    In the paper we describe a complex system we developed for automatic acquisition of a large corpus of spoken Czech. The system is capable of continuous monitoring of a selected Czech TV station and providing automatic transcription of its audio track. The transcription is performed by our own speech recognition engine that employs a vocabulary with 350 thousand most frequent Czech words (and word-forms). Transcription accuracy is fairly good for studio speech (above 90 per cent), but may drop significantly for noisy recordings and spontaneous speech. Anyway, the system runs without any human supervision and during its operation in 2007 it collected, transcribed, stored and indexed more than 1800 hours of Czech spoken documents. Any word or word combination in this corpus can be easily searched by a full-text search engine with Internet access.
  • Keywords
    indexing; natural languages; speech recognition equipment; Czech broadcast speech indexing; automatic collection; automatic transcription; full-text search; speech recognition engine; Audio recording; Humans; Indexing; Internet; Radio broadcasting; Search engines; Signal processing; Speech processing; Speech recognition; TV broadcasting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    MELECON 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference
  • Conference_Location
    Valletta
  • Print_ISBN
    978-1-4244-5793-9
  • Type

    conf

  • DOI
    10.1109/MELCON.2010.5476306
  • Filename
    5476306