Title :
LipActs: Efficient representations for visual speakers
Author_Institution :
AT&T Labs. Res., Middletown, NJ, USA
Abstract :
Video-based lip activity analysis has been successfully used for assisting speech recognition for almost a decade. Surprisingly, this same capability has not been heavily used for near real-time visual speaker retrieval and verification, due to tracking complexity, inadequate or difficult feature determination, and the need for a large amount of pre-labeled data for model training. This paper explores the performance of several solutions using modern histogram of oriented gradients (HOG) features, several quantization techniques, and analyzes the benefits of temporal sampling and spatial partitioning to derive a representation called LipActs. Two datasets are used for evaluation: one with 81 participants derived from varying quality YouTube content and one with 3 participants derived from a forward facing mobile video camera with 10 varied lighting and capture angle environments. Over these datasets, LipActs with a moderate number of pooled temporal frames and multi-resolution spatial quantization, offer an improvement of 37-73% over raw features when optimizing for lowest equal error rate (EER).
Keywords :
computational complexity; speech recognition; video signal processing; EER; HOG; LipActs; YouTube content; equal error rate; histogram of oriented gradients; mobile video camera; quantization techniques; spatial partitioning; speech recognition; temporal sampling; tracking complexity; video based lip activity analysis; visual speaker retrieval; Detectors; Face; Feature extraction; Histograms; Quantization; Visualization; Vocabulary; feature extraction; learning systems; verification; video analysis;
Conference_Titel :
Multimedia and Expo (ICME), 2011 IEEE International Conference on
Conference_Location :
Barcelona
Print_ISBN :
978-1-61284-348-3
Electronic_ISBN :
1945-7871
DOI :
10.1109/ICME.2011.6012102