مرکز منطقه ای اطلاع رساني علوم و فناوري - Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition

DocumentCode :

3162295

Title :

Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition

Author :

Yu, Dong ; Siniscalchi, Sabato Marco ; Deng, Li ; Lee, Chin-Hui

Author_Institution :

Speech Res. Group, Microsoft Res., Redmond, WA, USA

fYear :

2012

fDate :

25-30 March 2012

Firstpage :

4169

Lastpage :

4172

Abstract :

Generation of high-precision sub-phonetic attribute (also known as phonological features) and phone lattices is a key frontend component for detection-based bottom-up speech recognition. In this paper we employ deep neural networks (DNNs) to improve detection accuracy over conventional shallow MLPs (multi-layer perceptrons) with one hidden layer. A range of DNN architectures with five to seven hidden layers and up to 2048 hidden units per layer have been explored. Training on the SI84 and testing on the Nov92 WSJ data, the proposed DNNs achieve significant improvements over the shallow MLPs, producing greater than 90% frame-level attribute estimation accuracies for all 21 attributes tested for the full system. On the phone detection task, we also obtain excellent frame-level accuracy of 86.6%. With this level of high-precision detection of basic speech units we have opened the door to a new family of flexible speech recognition system design for both top-down and bottom-up, lattice-based search strategies and knowledge integration.

Keywords :

estimation theory; neural nets; speech processing; speech recognition; Nov92 WSJ data; attribute estimation accuracy; bottom-up search strategy; deep neural networks; detection-based bottom-up speech recognition; frame-level accuracy; frame-level attribute estimation; hidden layers; high-precision sub-phonetic attribute; knowledge integration; lattice-based search strategy; phone detection; phone estimation accuracy; phone lattices; phonological features; top-down search strategy; Accuracy; Detectors; Feature extraction; Neural networks; Speech; Speech recognition; Training; attribute detection; automatic speech attribute transcription; deep neural networks; detection-based ASR; phone recognition; phonological features;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on

Conference_Location :

Kyoto

ISSN :

1520-6149

Print_ISBN :

978-1-4673-0045-2

Electronic_ISBN :

1520-6149

Type :

conf

DOI :

10.1109/ICASSP.2012.6288837

Filename :

6288837

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3162295