مرکز منطقه ای اطلاع رساني علوم و فناوري - Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs

DocumentCode :

1762187

Title :

Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs

Author :

Minotto, Vicente P. ; Jung, Claudio R. ; Bowon Lee

Author_Institution :

Inst. of Inf., Fed. Univ. of Rio Grande do Sul, Porto Alegre, Brazil

Volume :

Issue :

fYear :

2014

fDate :

41791

Firstpage :

1032

Lastpage :

1044

Abstract :

Humans can extract speech signals that they need to understand from a mixture of background noise, interfering sound sources, and reverberation for effective communication. Voice Activity Detection (VAD) and Sound Source Localization (SSL) are the key signal processing components that humans perform by processing sound signals received at both ears, sometimes with the help of visual cues by locating and observing the lip movements of the speaker. Both VAD and SSL serve as the crucial design elements for building applications involving human speech. For example, systems with microphone arrays can benefit from these for robust speech capture in video conferencing applications, or for speaker identification and speech recognition in Human Computer Interfaces (HCIs). The design and implementation of robust VAD and SSL algorithms in practical acoustic environments are still challenging problems, particularly when multiple simultaneous speakers exist in the same audiovisual scene. In this work we propose a multimodal approach that uses Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) for assessing the video and audio modalities through an RGB camera and a microphone array. By analyzing the individual speakers´ spatio-temporal activities and mouth movements, we propose a mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers. We tested the proposed algorithm in scenarios with up to three simultaneous speakers, showing an average VAD accuracy of 95.06% with an average error of 10.9 cm when estimating the three-dimensional locations of the speakers.

Keywords :

audio signal processing; audio-visual systems; cameras; hidden Markov models; human computer interaction; image colour analysis; interference (signal); microphone arrays; spatiotemporal phenomena; speaker recognition; support vector machines; HCI; HMM; RGB camera; SVM midfusion; acoustic environments; audio modalities; hidden Markov models; human computer interfaces; interfering sound sources; lip movements; microphone array; microphone arrays; mid-fusion approach; mouth movements; robust SSL algorithms; robust VAD algorithms; robust speech capture; signal processing components; simultaneous-speaker voice activity detection; simultaneous-speaker voice activity localization; sound signal processing; sound source localization; speaker identification; speaker spatio-temporal activities; speech recognition; speech signal extraction; support vector machines; video conferencing applications; video modalities; Accuracy; Array signal processing; Human computer interaction; Microphone arrays; Speech; Visualization; Beamforming; SRP-PHAT; hidden Markov model; multimodal fusion; optical-flow; sound source localization; support vector machine; voice activity detection;

fLanguage :

English

Journal_Title :

Multimedia, IEEE Transactions on

Publisher :

ieee

ISSN :

1520-9210

Type :

jour

DOI :

10.1109/TMM.2014.2305632

Filename :

6737222

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1762187