Title :
On the importance of modeling and robustness for deep neural network feature
Author :
Shuo-Yiin Chang ; Wegmann, Steven
Author_Institution :
EECS Dept., Univ. of California-Berkeley, Berkeley, CA, USA
Abstract :
A large body of research has shown that acoustic features for speech recognition can be learned from data using neural networks with multiple hidden layers (DNNs) and that these learned features are superior to standard features (e.g., MFCCs). However, this superiority is usually demonstrated when the data used to learn the features is very similar in character to the data used to test recognition performance. An open question is how well these learned features generalize to realistic data that is different in character to their training data. The ability of a feature representation to generalize to unfamiliar data is a highly desirable form of robustness. In this paper we investigate the robustness of two DNN-based feature sets to training/test mismatch using the ICSI meeting corpus. The experiments were performed under 3 training/test scenarios: (1) matched near-field (2) matched far-field and (3) the mismatched condition near-field training with far-field testing. The experiments leverage simulation and a novel sampling process that we have developed for diagnostic analysis within the HMM-based speech recognition framework. First, diagnostic analysis shows that a DNN-based feature representation that uses MFCC inputs (MFCC-DNN) is indeed superior to the corresponding MFCC baselines in the two matched scenarios where the source of recognition errors are from incorrect model, but the DNN-based features and MFCCs have nearly identical and poor performance in the mismatched scenario. Second, we show that a DNN-based feature representation that uses a more robust input, namely power normalized spectrum (PNS) and Gabor filters, performs nearly as well as the MFCC-DNN features in the matched scenarios and much better than MFCCs and MFCC-DNNs in the mismatched scenario.
Keywords :
feature extraction; neural nets; signal representation; speech recognition; DNN-based feature representation; DNN-based feature sets; Gabor filters; HMM-based speech recognition framework; ICSI meeting corpus; MFCC inputs; MFCC-DNN; PNS; acoustic features; diagnostic analysis; far-field testing; learned features; matched far-field; matched near-field; mismatched condition near-field training; neural networks; power normalized spectrum; realistic data; recognition errors; recognition performance; training data; Data models; Hidden Markov models; Mel frequency cepstral coefficient; Neural networks; Robustness; Speech recognition; Training; acoustic feature; deep neural network; robust speech recognition;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location :
South Brisbane, QLD
DOI :
10.1109/ICASSP.2015.7178828