DocumentCode
3585061
Title
Artificial neural network features for speaker diarization
Author
Yella, Sree Harsha ; Stolcke, Andreas ; Slaney, Malcolm
Author_Institution
Microsoft Res., Mountain View, CA, USA
fYear
2014
Firstpage
402
Lastpage
406
Abstract
Speaker diarization finds contiguous speaker segments in an audio recording and clusters them by speaker identity, without any a-priori knowledge. Diarization is typically based on short-term spectral features such as Mel-frequency cepstral coefficients (MFCCs). Though these features carry average information about the vocal tract characteristics of a speaker, they are also susceptible to factors unrelated to the speaker identity. In this study, we propose an artificial neural network (ANN) architecture to learn a feature transform that is optimized for speaker diarization. We train a multi-hidden-layer ANN to judge whether two given speech segments came from the same or different speakers, using a shared transform of the input features that feeds into a bottleneck layer. We then use the bottleneck layer activations as features, either alone or in combination with baseline MFCC features in a multistream mode, for speaker diarization on test data. The resulting system is evaluated on various corpora of multi-party meetings. A combination of MFCC and ANN features gives up to 14% relative reduction in diarization error, demonstrating that these features are providing an additional independent source of knowledge.
Keywords
neural nets; speaker recognition; transforms; MFCC; artificial neural network architecture; artificial neural network features; audio recording; contiguous speaker segments; feature transform; mel-frequency cepstral coefficients; multihidden-layer ANN; multiparty meetings; multistream mode; shared transform; short-term spectral features; speaker diarization; speaker identity; speech segments; vocal tract characteristics; Abstracts; Artificial neural networks; Density estimation robust algorithm; ISO standards; Mel frequency cepstral coefficient; Noise; artificial neural networks; discriminative feature extraction; speaker diarization;
fLanguage
English
Publisher
ieee
Conference_Titel
Spoken Language Technology Workshop (SLT), 2014 IEEE
Type
conf
DOI
10.1109/SLT.2014.7078608
Filename
7078608
Link To Document