Title :
A distributed architecture for fast SGD sequence discriminative training of DNN acoustic models
Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
Abstract :
We describe a hybrid GPU/CPU architecture for stochastic gradient descent training of neural network acoustic models under a lattice-based minimum Bayes risk (MBR) criterion. The crux of the method is to run SGD on a GPU card which consumes frame-randomized mini-batches produced by multiple workers running on a cluster of multi-core CPU nodes which compute HMM state MBR occupancies. To minimize communication cost, a separate thread running on the GPU host receives minibatches from and sends updated models to the workers, and communicates with the SGD thread via a producer-consumer queue of minibatches. Using this architecture, it is possible to match the speed of GPU-based SGD cross-entropy (CE) training (1 hour of processing per 100 hours of audio on Switchboard). Additionally, we compare different ways of doing frame randomization and discuss experimental results on three LVCSR tasks (Switchboard 300 hours, English broadcast news 50 hours, and noisy Levantine telephone conversations 300 hours).
Keywords :
Bayes methods; acoustic signal processing; entropy; gradient methods; graphics processing units; hidden Markov models; learning (artificial intelligence); multi-threading; multiprocessing systems; neural nets; CE training; CPU architecture; DNN acoustic models; English broadcast news; GPU architecture; GPU card; GPU host; GPU-based SGD cross-entropy training; HMM state MBR occupancies; LVCSR tasks; MBR criterion; SGD sequence discriminative training; SGD thread; Switchboard task; communication cost minimization; distributed architecture; frame randomization; frame-randomized minibatches; lattice-based minimum Bayes risk criterion; multicore CPU node cluster; neural network acoustic models; noisy Levantine telephone conversations; producer-consumer queue; stochastic gradient descent training; thread running; Abstracts; Hidden Markov models; Instruction sets; Robustness; Training; neural network acoustic models; sequence discriminative training; stochastic gradient descent;
Conference_Titel :
Spoken Language Technology Workshop (SLT), 2014 IEEE
DOI :
10.1109/SLT.2014.7078571