Improvements to speaker adaptive training of deep neural networks

Author

Yajie Miao ; Lu Jiang ; Hao Zhang ; Metze, Florian

Author_Institution

Sch. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA

fYear

2014

Firstpage

165

Lastpage

170

Abstract

Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from video signals. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.

Keywords

Gaussian processes; acoustic signal processing; feature extraction; learning (artificial intelligence); mixture models; multimedia systems; natural language processing; neural nets; sensor fusion; speaker recognition; video signal processing; BNF generation; CNN acoustic modeling; GMM; Gaussian mixture acoustic models; SAT-DNN model; WER; bottleneck feature generation; convolutional neural network acoustic modeling; deep neural networks; feature learning; flexible feature fusion; global speaker attributes; i-vector extractor training; instructional videos; multilingual DNN-based feature extraction; multimedia data; speaker adaptive training; speaker i-vectors; video signals; visual features; word error rates; Acoustics; Convolution; Feature extraction; Filter banks; Speech; Training; Visualization; Deep neural networks; speaker adaptive training; speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Spoken Language Technology Workshop (SLT), 2014 IEEE

Type

conf

DOI

10.1109/SLT.2014.7078568

Filename

7078568