Title :
Cepstrum-Domain Model Combination Based on Decomposition of Speech and Noise Using MMSE-LSA for ASR in Noisy Environments
Author :
Kim, Hong Kook ; Rose, Richard C.
Author_Institution :
Dept. of Inf. & Commun., Gwangju Inst. of Sci. & Technol., Gwangju
fDate :
5/1/2009 12:00:00 AM
Abstract :
This paper presents an efficient method for combining models of speech and noise for robust speech recognition applications in noisy environments. This method decomposes the cepstrum domain representation of noise-corrupted speech into clean speech cepstrum and background noise cepstrum components using a minimum mean squared error-log spectral amplitude (MMSE-LSA) criterion. Speech recognition is then performed on noisy cepstrum domain observations using a model that is formed by parallel combination of cepstrum domain clean speech distributions and background noise distributions estimated using this MMSE-LSA based noise decomposition. This method is far more efficient than other parallel model combination (PMC) procedures because model combination is performed directly in the cepstrum domain rather than in the linear spectral domain. Whereas background noise model estimation is addressed as a separate issue in existing PMC procedures, this method explicitly incorporates a mechanism to continually update background noise models and signal-to-noise ratio (SNR) estimates over time. The performance of the proposed cepstrum-domain model combination method is compared with a well known implementation of PMC which uses a log-normal approximation when combining speech and background noise model means and variances on a connected digit string recognition task which is subjected to mismatched channel and environment conditions. As a result, it is shown that the proposed model combination technique gives a word error rate that is comparable to PMC when background noise information and SNR are known prior to estimation. The paper will also present the results of experiments where a combination of cepstrum-domain feature compensation and model combination are applied to this task.
Keywords :
cepstral analysis; speech recognition; background noise cepstrum component; background noise distribution; background noise model estimation; cepstrum domain clean speech distribution; cepstrum domain feature compensation; cepstrum domain model combination; cepstrum domain representation; cepstrum-domain model combination; digit string recognition; log-normal approximation; minimum mean squared error-log spectral amplitude criterion; noise decomposition; noise-corrupted speech; noisy cepstrum domain; noisy environment; parallel model combination; robust speech recognition; signal-to-noise ratio estimates; speech cepstrum; speech decomposition; word error rate; Automatic speech recognition; Background noise; Cepstrum; Error analysis; Noise level; Noise robustness; Signal to noise ratio; Speech enhancement; Speech recognition; Working environment noise; Acoustic model combination; cepstrum decomposition; feature compensation; minimum mean squared error–log spectral amplitude (MMSE-LSA); robust speech recognition;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2008.2012319