Title :
Audio-visual isolated digit recognition for whispered speech
Author :
Xing Fan ; Busso, Carlos ; Hansen, John H. L.
Author_Institution :
Center for Robust Speech Syst. (CRSS), Univ. of Texas at Dallas, Richardson, TX, USA
fDate :
Aug. 29 2011-Sept. 2 2011
Abstract :
Whisper is used by talkers intentionally in certain circumstances to protect personal privacy. Due to the absence of periodic excitation in the production of whisper, there are considerable differences between neutral and whispered speech in the spectral structure. Therefore, performance of speech recognition systems trained with high energy voiced phonemes, degrades significantly when tested with whisper. In this study, we investigate the use of multi-streammodels in isolated digit recognition of whispered speech. A small digit corpus with one subject speaking both whisper and neutral speech is collected. The eigenlips approach is used to extract visual features describing the lips appearance. MFCCs are employed as feature set for speech. Two HMM systems are trained for each stream independently and their scores are linearly combined. The resulted word accuracy shows significant improvement (37%, absolute). The study represents one of the first advancements in whisper recognition using audiovisual features. It also supports the use of multistream HMM to improve the performance on whisper/neutral speech conditions.
Keywords :
audio-visual systems; cepstral analysis; feature extraction; hidden Markov models; speech recognition; HMM system; MFCC; audio-visual isolated digit recognition; eigenlip approach; energy voiced phonemes; multistream model; visual feature extraction; whisper recognition; whispered speech recognition system; Accuracy; Feature extraction; Hidden Markov models; Speech; Speech recognition; Vectors; Visualization;
Conference_Titel :
Signal Processing Conference, 2011 19th European
Conference_Location :
Barcelona