Voice activity detection using visual information

Author

Liu, Peng ; Wang, Zuoying

Author_Institution

Dept. of Electron. Eng., Tsinghua Univ., Beijing, China

Volume

1

fYear

2004

fDate

17-21 May 2004

Abstract

In traditional voice activity detection (VAD) approaches, some features of the audio stream, for example frame-energy features, are used for voice decision. In this paper, we present the general framework of a visual information based VAD approach in a multi-modal system. Firstly, the Gauss mixture visual models of voice and non-voice are designed, and the decision rule is discussed in detail. Subsequently, the visual feature extraction method for VAD is investigated. The best visual feature structure and the best mixture number are selected experimentally. Our experiments show that using visual information based VAD, prominent reduction in frame error rate (31.1% relatively) is achieved, and the audio-visual stream can be segmented into sentences for recognition much more precisely (98.4% relative reduction in sentence break error rate), compared to the frame-energy based approach in the clean audio case. Furthermore, the performance of visual based VAD is independent of background noise.

Keywords

Gaussian distribution; error statistics; feature extraction; speech recognition; Gauss mixture visual models; VAD; audio-visual stream segmentation; background noise independence; decision rule; frame error rate; multi-modal system; performance; sentence break error rate; speech recognition; visual feature extraction; visual information; voice activity detection; Background noise; Crosstalk; Entropy; Error analysis; Feature extraction; Gaussian processes; Lips; Pattern recognition; Streaming media; Working environment noise;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on

ISSN

1520-6149

Print_ISBN

0-7803-8484-9

Type

conf

DOI

10.1109/ICASSP.2004.1326059

Filename

1326059