• DocumentCode
    3078945
  • Title

    Automatic speech recognition improved by two-layered audio-visual integration for robot audition

  • Author

    Yoshida, Takami ; Nakadai, Kazuhiro ; Okuno, Hiroshi G.

  • Author_Institution
    Mech. & Environ. Inf., Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2009
  • fDate
    7-10 Dec. 2009
  • Firstpage
    604
  • Lastpage
    609
  • Abstract
    The robustness and high performance of ASR is required for robot audition, because people usually speak to each other to communicate. This paper presents two-layered audio-visual integration to make automatic speech recognition (ASR) more robust against speaker´s distance and interfering talkers or environmental noises. It consists of Audio-Visual Voice Activity Detection (AV-VAD) and Audio-Visual Speech Recognition (AVSR). The AV-VAD layer integrates several AV features based on a Bayesian network to robustly detect voice activity, or speaker´s utterance duration. This is because the performance of VAD strongly affects that of ASR. The AVSR layer integrates the reliability estimation of acoustic features and that of visual features by using a missing-feature theory method. The reliability of audio features is more weighted in a clean acoustic environment, while that of visual features is more weighted in a noisy environment. This AVSR layer integration can cope with dynamically-changing environments in acoustics or vision. The proposed AV integrated ASR is implemented on HARK, our open-sourced robot audition software, with an 8 ch microphone array. Empirical results show that our system improves 9.9 and 16.7 points of ASR results with/without microphone array processing, respectively, and also improves robustness against several auditory/visual noise conditions.
  • Keywords
    acoustic signal detection; array signal processing; audio-visual systems; belief networks; feature extraction; hearing; public domain software; robots; speaker recognition; AVSR layer integration; Bayesian network; HARK; acoustic features reliability estimation; audio-visual speech recognition; audio-visual voice activity detection; automatic speech recognition; microphone array processing; missing-feature theory method; open-sourced robot audition software; robot audition; two-layered audio-visual integration; voice activity detection; Acoustic noise; Acoustic signal detection; Automatic speech recognition; Bayesian methods; Microphone arrays; Noise robustness; Reliability theory; Robotics and automation; Speech recognition; Working environment noise;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on
  • Conference_Location
    Paris
  • Print_ISBN
    978-1-4244-4597-4
  • Electronic_ISBN
    978-1-4244-4588-2
  • Type

    conf

  • DOI
    10.1109/ICHR.2009.5379586
  • Filename
    5379586