Title :
Improved decision trees for multi-stream HMM-based audio-visual continuous speech recognition
Author :
Huang, Jing ; Visweswariah, Karthik
Author_Institution :
IBM T.J. Watson Res. Center, Yorktown Heights, NY, USA
fDate :
Nov. 13 2009-Dec. 17 2009
Abstract :
HMM-based audio-visual speech recognition (AVSR) systems have shown success in continuous speech recognition by combining visual and audio information, especially in noisy environments. In this paper we study how to improve decision trees used to create context classes in HMM-based AVSR systems. Traditionally, visual models have been trained with the same context classes as the audio only models. In this paper we investigate the use of separate decision trees to model the context classes for the audio and visual streams independently. Additionally we investigate the use of viseme classes in the decision tree building for the visual stream. On experiments with a 37-speaker 1.5 hours test set (about 12000 words) of continuous digits in noise, we obtain about a 3% absolute (20% relative) gain on AVSR performance by using separate decision trees for the audio and visual streams when using viseme classes in decision tree building for the visual stream.
Keywords :
audio-visual systems; decision trees; hidden Markov models; speech recognition; AVSR performance; HMM-based AVSR systems; decision tree building; decision trees; multistream HMM-based audio-visual continuous speech recognition; noisy environments; viseme classes; visual and audio information; visual stream; Acoustic noise; Automatic speech recognition; Context modeling; Decision trees; Decoding; Hidden Markov models; Performance gain; Signal to noise ratio; Speech recognition; Streaming media;
Conference_Titel :
Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on
Conference_Location :
Merano
Print_ISBN :
978-1-4244-5478-5
Electronic_ISBN :
978-1-4244-5479-2
DOI :
10.1109/ASRU.2009.5373454