Title :
Designing a multimodal corpus of audio-visual speech using a high-speed camera
Author :
Karpov, Aleksey ; Ronzhin, Anatoly ; Kipyatkova, Irina
Author_Institution :
Speech & Multimodal Interfaces Lab., St. Petersburg Inst. for Inf. & Autom., St. Petersburg, Russia
Abstract :
In this paper, we present a research on designing and processing an audio-visual speech database for an automatic Russian speech recognition system using Oktava MK-012 microphone and JAI Pulnix RMC-6740GE high-speed camera (200 frames per second). Developed audio-visual speech recording system is described, it provides synchronization and fusion of audio and video data recorded by the independent sensors. The system automatically detects voice activity in audio signal and stores only speech fragments discarding non-informative signals. Also it takes into account and processes natural asynchrony of both speech modalities. Methods for feature extraction of acoustic (based on Mel-frequency cepstral coefficients) and visual speech (pixel-based features of mouth region) and multimodal data temporal segmentation (by forced alignment) are presented.
Keywords :
audio databases; image sensors; speech recognition; JAI Pulnix RMC-6740GE; Mel frequency cepstral coefficients; Oktava MK-012 microphone; Russian speech recognition system; audio signal; audio visual speech database; high speed camera; independent sensors; informative signals; mouth region; multimodal corpus design; multimodal data temporal segmentation; pixel based features; speech fragments; voice activity; audio-visual speech; automatic speech recognition; computer vision; high-speed camera; multimodal system;
Conference_Titel :
Signal Processing (ICSP), 2012 IEEE 11th International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4673-2196-9
DOI :
10.1109/ICoSP.2012.6491539