Title :
Modeling long temporal contexts in convolutional neural network-based phone recognition
Author_Institution :
MTA-SZTE Res. Group on Artificial Intell., Univ. of Szeged, Szeged, Hungary
Abstract :
The deep neural network component of current hybrid speech recognizers is trained on a context of consecutive feature vectors. Here, we investigate whether the time span of this input can be extended by splitting it up and modeling it in smaller chunks. One method for this is to train a hierarchy of two networks, while the less well-known split temporal context (STC) method models the left and right contexts of a frame separately. Here, we evaluate these techniques within a convolutional neural network framework, and find that the two approaches can be nicely combined. With the combined model we can expand the time-span of our network to 69 frames, and we achieve a 7.5% relative error rate reduction compared to modeling this large context as one block. We report a phone error rate of 17.1% on the TIMIT core test set, which is one of the best scores published.
Keywords :
convolution; learning (artificial intelligence); neural nets; speech recognition; vectors; STC method; TIMIT core test set; consecutive feature vectors; convolutional neural network-based phone recognition; current hybrid speech recognizers; deep neural network component; long temporal context modeling; phone error rate; relative error rate reduction; split temporal context method; Context; Context modeling; Convolution; Error analysis; Hidden Markov models; Neural networks; Speech recognition; Deep neural network; TIMIT; convolutional neural network; maxout; split temporal context;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location :
South Brisbane, QLD
DOI :
10.1109/ICASSP.2015.7178837