مرکز منطقه ای اطلاع رساني علوم و فناوري - Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

DocumentCode :

3123684

Title :

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

Author :

Jia Pan ; Cong Liu ; Zhiguo Wang ; Yu Hu ; Hui Jiang

Author_Institution :

iFlytek Res., Hefei, China

fYear :

2012

fDate :

5-8 Dec. 2012

Firstpage :

301

Lastpage :

305

Abstract :

Recently, it has been reported that context-dependent deep neural network (DNN) has achieved some unprecedented gains in many challenging ASR tasks, including the well-known Switchboard task. In this paper, we first investigate DNN for several large vocabulary speech recognition tasks. Our results have confirmed that DNN can consistently achieve about 25-30% relative error reduction over the best discriminatively trained GMMs even in some ASR tasks with up to 700 hours of training data. Next, we have conducted a series of experiments to study where the unprecedented gain of DNN comes from. Our experiments show the gain of DNN is almost entirely attributed to DNN´s feature vectors that are concatenated from several consecutive speech frames within a relatively long context window. At last, we have proposed a few ideas to reconfigure the DNN input features, such as using logarithm spectrum features or VTLN normalized features in DNN. Our results have shown that each of these methods yields over 3% relative error reduction over the traditional MFCC or PLP features in DNN.

Keywords :

Gaussian processes; acoustic signal processing; learning (artificial intelligence); neural nets; speech recognition; vectors; vocabulary; ASR tasks; DNN input feature vectors; GMMS; Gaussian mixture models; VTLN normalized features; acoustic modeling; automatic speech recognition; context-dependent deep neural network; discriminatively trained GMM; large vocabulary continuous speech recognition; logarithm spectrum features; relative error reduction; speech frames; switchboard task; training data; Context; Hidden Markov models; Neural networks; Speech recognition; Switches; Training; Vectors; acoustic modeling; deep neural networks; pre-training; speech recognition;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on

Conference_Location :

Kowloon

Print_ISBN :

978-1-4673-2506-6

Electronic_ISBN :

978-1-4673-2505-9

Type :

conf

DOI :

10.1109/ISCSLP.2012.6423452

Filename :

6423452

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3123684