Title :
Voice activity detection using visual information
Author :
Liu, Peng ; Wang, Zuoying
Author_Institution :
Dept. of Electron. Eng., Tsinghua Univ., Beijing, China
Abstract :
In traditional voice activity detection (VAD) approaches, some features of the audio stream, for example frame-energy features, are used for voice decision. In this paper, we present the general framework of a visual information based VAD approach in a multi-modal system. Firstly, the Gauss mixture visual models of voice and non-voice are designed, and the decision rule is discussed in detail. Subsequently, the visual feature extraction method for VAD is investigated. The best visual feature structure and the best mixture number are selected experimentally. Our experiments show that using visual information based VAD, prominent reduction in frame error rate (31.1% relatively) is achieved, and the audio-visual stream can be segmented into sentences for recognition much more precisely (98.4% relative reduction in sentence break error rate), compared to the frame-energy based approach in the clean audio case. Furthermore, the performance of visual based VAD is independent of background noise.
Keywords :
Gaussian distribution; error statistics; feature extraction; speech recognition; Gauss mixture visual models; VAD; audio-visual stream segmentation; background noise independence; decision rule; frame error rate; multi-modal system; performance; sentence break error rate; speech recognition; visual feature extraction; visual information; voice activity detection; Background noise; Crosstalk; Entropy; Error analysis; Feature extraction; Gaussian processes; Lips; Pattern recognition; Streaming media; Working environment noise;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on
Print_ISBN :
0-7803-8484-9
DOI :
10.1109/ICASSP.2004.1326059