• DocumentCode
    112359
  • Title

    Speech Recognition Using Long-Span Temporal Patterns in a Deep Network Model

  • Author

    Siniscalchi, Sabato Marco ; Dong Yu ; Li Deng ; Chin-Hui Lee

  • Author_Institution
    Fac. of Archit. & Eng., Univ. of Enna Kore, Enna, Italy
  • Volume
    20
  • Issue
    3
  • fYear
    2013
  • fDate
    Mar-13
  • Firstpage
    201
  • Lastpage
    204
  • Abstract
    In recent years, there has been a renewed interest in the use of artificial neural networks (ANNs) for speech applications, and it seems that a new trend to move the speech technology forward has begun. Two main contributions have triggered such a new trend: 1) a major advance has been made in training the weights in deep neural networks (DNNs), and a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture has outperformed a conventional Gaussian mixture model hidden Markov model (GMM-HMM) automatic speech recognition (ASR) system on a challenging business search dataset, and 2) it has been shown that phoneme classification can be boosted by using a hierarchical structure of multi-layer perceptrons (MLPs) trained to model long-span temporal patterns with beneficial effects on language recognition tasks. In this work, we combine these two lines of research and demonstrate that word recognition accuracy can be significantly enhanced by arranging DNNs in a hierarchical structure to model long-term energy trajectories. The proposed solution has been evaluated on the 5000-word Wall Street Journal task, resulting in consistent and significant improvements in both phone and word recognition accuracy rates. We have also analyzed the effects of various modeling choices on the system performance, and several architectural solutions have been compared.
  • Keywords
    hidden Markov models; multilayer perceptrons; speech recognition; ANN; DNN-HMM hybrid architecture; GMM-HMM automatic speech recognition system; Gaussian mixture model hidden Markov model ASR system; MLP; Wall Street Journal task; artificial neural networks; business search dataset; hierarchical structure; language recognition tasks; long-span temporal patterns; long-term energy trajectories; multilayer perceptrons; phone accuracy rate; phoneme classification; pre-trained deep neural network hidden Markov model; speech technology; word recognition accuracy; word recognition accuracy rate; Computational modeling; Data models; Hidden Markov models; Neural networks; Speech; Speech recognition; Training; Automatic speech recognition; deep neural networks; large vocabulary continuous speech recognition;
  • fLanguage
    English
  • Journal_Title
    Signal Processing Letters, IEEE
  • Publisher
    ieee
  • ISSN
    1070-9908
  • Type

    jour

  • DOI
    10.1109/LSP.2013.2237901
  • Filename
    6403509