Title :
Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs
Author :
Minotto, Vicente P. ; Jung, Claudio R. ; Bowon Lee
Author_Institution :
Inst. of Inf., Fed. Univ. of Rio Grande do Sul, Porto Alegre, Brazil
Abstract :
Humans can extract speech signals that they need to understand from a mixture of background noise, interfering sound sources, and reverberation for effective communication. Voice Activity Detection (VAD) and Sound Source Localization (SSL) are the key signal processing components that humans perform by processing sound signals received at both ears, sometimes with the help of visual cues by locating and observing the lip movements of the speaker. Both VAD and SSL serve as the crucial design elements for building applications involving human speech. For example, systems with microphone arrays can benefit from these for robust speech capture in video conferencing applications, or for speaker identification and speech recognition in Human Computer Interfaces (HCIs). The design and implementation of robust VAD and SSL algorithms in practical acoustic environments are still challenging problems, particularly when multiple simultaneous speakers exist in the same audiovisual scene. In this work we propose a multimodal approach that uses Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) for assessing the video and audio modalities through an RGB camera and a microphone array. By analyzing the individual speakers´ spatio-temporal activities and mouth movements, we propose a mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers. We tested the proposed algorithm in scenarios with up to three simultaneous speakers, showing an average VAD accuracy of 95.06% with an average error of 10.9 cm when estimating the three-dimensional locations of the speakers.
Keywords :
audio signal processing; audio-visual systems; cameras; hidden Markov models; human computer interaction; image colour analysis; interference (signal); microphone arrays; spatiotemporal phenomena; speaker recognition; support vector machines; HCI; HMM; RGB camera; SVM midfusion; acoustic environments; audio modalities; hidden Markov models; human computer interfaces; interfering sound sources; lip movements; microphone array; microphone arrays; mid-fusion approach; mouth movements; robust SSL algorithms; robust VAD algorithms; robust speech capture; signal processing components; simultaneous-speaker voice activity detection; simultaneous-speaker voice activity localization; sound signal processing; sound source localization; speaker identification; speaker spatio-temporal activities; speech recognition; speech signal extraction; support vector machines; video conferencing applications; video modalities; Accuracy; Array signal processing; Human computer interaction; Microphone arrays; Speech; Visualization; Beamforming; SRP-PHAT; hidden Markov model; multimodal fusion; optical-flow; sound source localization; support vector machine; voice activity detection;
Journal_Title :
Multimedia, IEEE Transactions on
DOI :
10.1109/TMM.2014.2305632