Title :
Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition
Author :
Abdelaziz, Ahmed Hussen ; Zeiler, Steffen ; Kolossa, Dorothea
Author_Institution :
Cognitive Signal Process. Group, Ruhr-Univ. Bochum, Bochum, Germany
Abstract :
With the increasing use of multimedia data in communication technologies, the idea of employing visual information in automatic speech recognition (ASR) has recently gathered momentum. In conjunction with the acoustical information, the visual data enhances the recognition performance and improves the robustness of ASR systems in noisy and reverberant environments. In audio-visual systems, dynamic weighting of audio and video streams according to their instantaneous confidence is essential for reliably and systematically achieving high performance. In this paper, we present a complete framework that allows blind estimation of dynamic stream weights for audio-visual speech recognition based on coupled hidden Markov models (CHMMs). As a stream weight estimator, we consider using multilayer perceptrons and logistic functions to map multidimensional reliability measure features to audiovisual stream weights. Training the parameters of the stream weight estimator requires numerous input-output tuples of reliability measure features and their corresponding stream weights. We estimate these stream weights based on oracle knowledge using an expectation maximization algorithm. We define 31-dimensional feature vectors that combine model-based and signal-based reliability measures as inputs to the stream weight estimator. During decoding, the trained stream weight estimator is used to blindly estimate stream weights. The entire framework is evaluated using the Grid audio-visual corpus and compared to state-of-the-art stream weight estimation strategies. The proposed framework significantly enhances the performance of the audio-visual ASR system in all examined test conditions.
Keywords :
audio streaming; audio-visual systems; decoding; expectation-maximisation algorithm; feature extraction; hidden Markov models; learning (artificial intelligence); multilayer perceptrons; speech coding; speech recognition; video streaming; 31-dimensional feature vector; ASR; CHMM; acoustical information; audio streaming; communication technology; coupled hidden Markov model; coupled-HMM-based audio-visual automatic speech recognition; decoding; expectation maximization algorithm; grid audio-visual corpus; learning dynamic stream weight blind estimation; logistic function; multidimensional reliability; multilayer perceptron; multimedia data; oracle knowledge; signal-based reliability; video streaming; Heuristic algorithms; Hidden Markov models; Reliability; Signal processing algorithms; Speech; Vectors; Weight measurement; Audio-visual speech recognition; coupled hidden Markov model; logistic regression; multilayer perceptron; reliability measure; stream weight;
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
DOI :
10.1109/TASLP.2015.2409785