DocumentCode :
15707
Title :
Regression-Based Context-Dependent Modeling of Deep Neural Networks for Speech Recognition
Author :
Guangsen Wang ; Khe Chai Sim
Author_Institution :
Dept. of Comput. Sci., Nat. Univ. of Singapore, Singapore, Singapore
Volume :
22
Issue :
11
fYear :
2014
fDate :
Nov. 2014
Firstpage :
1660
Lastpage :
1669
Abstract :
The data sparsity problem is addressed by using the decision tree state clusters as the training targets for the state-of-the- art context-dependent (CD) deep neural network (DNN) systems. The CD states within a cluster cannot be distinguished at the frame level. We surmise that the state clustering may cause an issue for the standard CD-DNNs, which has so far not been addressed in the literature. In this paper, a logistic regression framework is proposed for the CD-DNNs based on a set of broad phone classes to address both the data sparsity and the clustering problems. To address the data sparsity issue, the triphones are clustered into shorter biphones with broad phone contexts under multiple articulatory categories. A DNN is trained to discriminate the disjoint biphone clusters within each articulatory category. The regression bases are formed by the concatenated log posterior probabilities of all the broad phone DNNs. Logistic regression is used to transform the regression bases into the triphone state posteriors. Clustering of the regression parameters is used to reduce the regression model complexity while still achieving unique acoustic scores for all possible triphones. Based on some approximations, the regression model can be trained as a sparse softmax layer and its parameters can be learned by optimizing the cross-entropy criterion. The experimental results on a broadcast news transcription task reveal that the proposed regression-based CD-DNN significantly outperforms the standard CD-DNN. The best system provides a 1.3% absolute word error rate reduction compared to the best standard CD-DNN system.
Keywords :
approximation theory; decision trees; neural nets; regression analysis; speech recognition; CD-DNN; Logistic regression; approximations; clustering problems; concatenated log posterior probabilities; context-dependent deep neural network systems; cross-entropy criterion; data sparsity; decision tree state clusters; deep neural networks; multiple articulatory categories; regression-based context-dependent modeling; sparse softmax layer; speech recognition; Approximation methods; Context; Context modeling; Detectors; Equations; Mathematical model; Training; Articulatory features; context dependent modeling; deep neural network; logistic regression;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
2329-9290
Type :
jour
DOI :
10.1109/TASLP.2014.2344855
Filename :
6872780
Link To Document :
بازگشت