• DocumentCode
    177687
  • Title

    Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors

  • Author

    Hong Liu ; Ting Fan ; Pingping Wu

  • Author_Institution
    Key Lab. of Machine perception & Intell., Peking Univ., Shenzhen, China
  • fYear
    2014
  • fDate
    24-28 Aug. 2014
  • Firstpage
    785
  • Lastpage
    790
  • Abstract
    Although keyword spotting (KWS) technologies have been successfully applied to some applications, most KWS systems have a common problem of noise-robustness when applied to real-world environments. Audio-visual keyword spotting (AVKWS) using both acoustic and visual information is a solution to complementarily solve the problem. Most existing audio-visual speech recognition (AVSR) systems extract geometric features as visual features, which heavily rely on accurate and reliable detection and tracking of facial feature points. To avoid this defect of geometric features, an appearance-based discriminative local spatial-temporal descriptor (disCLBP-TOP) is proposed in this paper, which devotes to extracting robust and discriminative patterns of interest. Besides, a parallel two-step recognition based on both acoustic and visual keyword searching and re-scoring is conducted, which complementarily makes the best of two modalities under different noisy conditions. Adaptive weights for decision fusion are generated using a sigmoid function based on reliabilities of the two modalities, capable of adapting to various noisy conditions. Experiments show that our proposed parallel AVKWS strategy based on decision fusion significantly improves the noise robustness and attains better performance than feature fusion based audio-visual spotter. Additionally, disCLBP-TOP shows more competitive performance than CLBP-TOP.
  • Keywords
    face recognition; feature extraction; sensor fusion; speech recognition; AVKWS; AVSR systems; KWS technologies; Mandarin; appearance-based discriminative local spatial-temporal descriptor; audio-visual keyword spotting; audio-visual speech recognition; decision fusion; disCLBP-TOP; facial feature points detection; facial feature points tracking; geometric feature extraction; parallel two-step recognition; sigmoid function; visual keyword searching; Acoustics; Feature extraction; Hidden Markov models; Noise; Noise measurement; Reliability; Visualization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition (ICPR), 2014 22nd International Conference on
  • Conference_Location
    Stockholm
  • ISSN
    1051-4651
  • Type

    conf

  • DOI
    10.1109/ICPR.2014.145
  • Filename
    6976855