• DocumentCode
    72243
  • Title

    Audio-Visual Voice Activity Detection Using Diffusion Maps

  • Author

    Dov, David ; Talmon, Ronen ; Cohen, Israel

  • Author_Institution
    Dept. of Electr. Eng., Technion - Israel Inst. of Technol., Haifa, Israel
  • Volume
    23
  • Issue
    4
  • fYear
    2015
  • fDate
    Apr-15
  • Firstpage
    732
  • Lastpage
    745
  • Abstract
    The performance of traditional voice activity detectors significantly deteriorates in the presence of highly nonstationary noise and transient interferences. One solution is to incorporate a video signal which is invariant to the acoustic environment. Although several voice activity detectors based on the video signal were recently presented, merely few detectors which are based on both the audio and the video signals exist in the literature to date. In this paper, we present an audio-visual voice activity detector and show that the incorporation of both audio and video signals is highly beneficial for voice activity detection. The algorithm is based on a supervised learning procedure, and a labeled training data set is considered. The algorithm comprises a feature extraction procedure, where the features are designed to separate speech from nonspeech frames. Diffusion maps is applied separately and similarly to the features of each modality and builds a low dimensional representation. Using the new representation, we propose a measure for voice activity which is based on a supervised learning procedure and the variability between adjacent frames in time. The measures of the two modalities are merged to provide voice activity detection based on both the audio and the video signals. Experimental results demonstrate the improved performance of the proposed algorithm compared to state-of-the-art detectors.
  • Keywords
    audio signals; audio-visual systems; feature extraction; learning (artificial intelligence); speech recognition; video signals; acoustic environment; adjacent frames; audio signals; audio-visual voice activity detection; diffusion maps; feature extraction procedure; labeled training data set; low dimensional representation; nonstationary noise; supervised learning procedure; transient interferences; video signal; Detectors; Lips; Mouth; Noise; Speech; Speech processing; Transient analysis; Audio-visual speech processing; diffusion maps; voice activity detection;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2015.2405481
  • Filename
    7045572