• DocumentCode
    3162875
  • Title

    Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

  • Author

    Abdel-Hamid, Ossama ; Mohamed, Abdel-rahman ; Jiang, Hui ; Penn, Gerald

  • Author_Institution
    Dept. of Comput. Sci. & Eng., York Univ., Toronto, ON, Canada
  • fYear
    2012
  • fDate
    25-30 March 2012
  • Firstpage
    4277
  • Lastpage
    4280
  • Abstract
    Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.
  • Keywords
    convolution; filtering theory; hidden Markov models; neural nets; speech recognition; convolutional neural network; frequency domain; hybrid neural network-hidden Markov model; local filtering; max-pooling; speech recognition; standard TIMIT data set; Acoustics; Artificial neural networks; Convolution; Hidden Markov models; Speech; Speech recognition; Training; acoustic modeling; local filtering; max-pooling; neural networks; speech recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on
  • Conference_Location
    Kyoto
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4673-0045-2
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2012.6288864
  • Filename
    6288864