Title :
Speech recognition with auxiliary information
Author :
Stephenson, Todd A. ; Doss, Mathew Magimai ; Bourlard, Hervé
Author_Institution :
Dalle Molle Inst. for Perceptual Artificial Intelligence, Martigny, Switzerland
fDate :
5/1/2004 12:00:00 AM
Abstract :
State-of-the-art automatic speech recognition (ASR) systems are usually based on hidden Markov models (HMMs) that emit cepstral-based features which are assumed to be piecewise stationary. While not really robust to noise, these features are also known to be very sensitive to "auxiliary" information, such as pitch, energy, rate-of-speech (ROS), etc. Attempts so far to include such auxiliary information in state-of-the-art ASR systems have often been based on simply appending these auxiliary features to the standard acoustic feature vectors. In the present paper, we investigate different approaches to incorporating this auxiliary information using dynamic Bayesian networks (DBNs) or hybrid HMM/ANNs (HMMs with artificial neural networks). These approaches are motivated by the fact that the auxiliary information is not necessarily (directly) emitted by the HMM states but, rather, carries higher-level information (e.g., speaker characteristics) that is correlated with the standard features. As implicitly done for gender modeling elsewhere, this auxiliary information then appears as a conditional variable in the emission distributions and can be hidden (except in the case of some HMM/ANNs) as its estimates become too noisy. Based on recognition experiments carried out on the OGI Numbers database (free format numbers spoken over the telephone), we show that auxiliary information that conditions the distribution of the standard features can, in certain conditions, provide more robust recognition than using auxiliary information that is appended to the standard features; this is most evident in the case of energy as an auxiliary variable in noisy speech.
Keywords :
Gaussian processes; belief networks; cepstral analysis; hidden Markov models; neural nets; speech processing; speech recognition; Gaussian mixture models; OGI numbers database; artificial neural networks; automatic speech recognition system; auxiliary information; cepstral-based features; dynamic Bayesian networks; emissions distributions; hidden Markov models; noisy speech; piecewise stationary; pitch; rate-of-speech; Acoustic emission; Acoustic noise; Artificial neural networks; Automatic speech recognition; Bayesian methods; Hidden Markov models; Noise robustness; Spatial databases; Speech recognition; Telephony;
Journal_Title :
Speech and Audio Processing, IEEE Transactions on
DOI :
10.1109/TSA.2003.822631