• DocumentCode
    37724
  • Title

    Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks

  • Author

    Sainath, Tara N. ; Kingsbury, Brian ; Soltau, Hagen ; Ramabhadran, Bhuvana

  • Author_Institution
    IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
  • Volume
    21
  • Issue
    11
  • fYear
    2013
  • fDate
    Nov. 2013
  • Firstpage
    2267
  • Lastpage
    2276
  • Abstract
    While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training these networks is slow. Even to date, the most common approach to train DNNs is via stochastic gradient descent, serially on one machine. Serial training, coupled with the large number of training parameters (i.e., 10-50 million) and speech data set sizes (i.e., 20-100 million training points) makes DNN training very slow for LVCSR tasks. In this work, we explore a variety of different optimization techniques to improve DNN training speed. This includes parallelization of the gradient computation during cross-entropy and sequence training, as well as reducing the number of parameters in the network using a low-rank matrix factorization. Applying the proposed optimization techniques, we show that DNN training can be sped up by a factor of 3 on a 50-hour English Broadcast News (BN) task with no loss in accuracy. Furthermore, using the proposed techniques, we are able to train DNNs on a 300-hr Switchboard (SWB) task and a 400-hr English BN task, showing improvements between 9-30% relative over a state-of-the art GMM/HMM system while the number of parameters of the DNN is smaller than the GMM/HMM system.
  • Keywords
    entropy; hidden Markov models; matrix decomposition; neural nets; optimisation; speech recognition; DNN training; English broadcast news; GMM-HMM system; LVCSR tasks; cross-entropy; deep neural networks; gradient computation; large speech tasks; large vocabulary continuous speech recognition; low-rank matrix factorization; optimization techniques; parallelization; serial training; speech data set sizes; stochastic gradient descent; switchboard task; time 300 hr; time 50 hour; training speed; Hidden Markov models; Large scale systems; Linear programming; Neural networks; Optimization; Pattern recognition; Speech recognition; deep neural networks; parallel optimization techniques;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2013.2284378
  • Filename
    6619439