Refinements of regression-based context-dependent modelling of deep neural networks for automatic speech recognition

Author

Guangsen Wang ; Khe Chai Sim

Author_Institution

Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore

fYear

2014

fDate

4-9 May 2014

Firstpage

3022

Lastpage

3026

Abstract

The data sparsity problem of context-dependent (CD) acoustic modelling of deep neural networks (DNNs) in speech recognition is addressed by using the decision tree state clusters as the training targets. The CD states within a cluster cannot be distinguished during decoding. This problem, referred to as the clustering problem, is not explicitly addressed in the current literature. In our previous work, a regression-based CD-DNN framework was proposed to address both the data sparsity and the clustering problems. This paper investigates several refinements for the regression-based CD-DNN including two more representative state approximation schemes and the incorporation of sequential learning. The two approximations are obtained based on the statistics learned from the training data. Sequential learning is applied to both broad phone DNN detectors and the regression NN. The proposed refinements are evaluated on a broadcast news transcription task. For the cross-entropy systems, the two approximations perform consistently better than our previous work. Consistent performance gain over the corresponding cross-entropy trained systems is also observed for both the baseline CD-DNN and the regression model with sequential learning.

Keywords

decision trees; learning (artificial intelligence); neural nets; regression analysis; speech recognition; CD acoustic modelling; baseline CD-DNN; broad phone DNN detectors; broadcast news transcription task; clustering problem; context-dependent acoustic modelling; cross-entropy trained systems; data sparsity problem; decision tree state clusters; deep neural networks; regression-based CD-DNN framework; representative state approximation schemes; sequential learning; speech recognition; training targets; Approximation methods; Detectors; Hidden Markov models; Mathematical model; Neural networks; Speech recognition; Training; Articulatory Features; Canonical State Modelling; Context Dependent Modelling; Deep Neural Network; Logistic Regression; Sequential Learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on

Conference_Location

Florence

Type

conf

DOI

10.1109/ICASSP.2014.6854155

Filename

6854155