Title :
Artificial neural network features for speaker diarization
Author :
Yella, Sree Harsha ; Stolcke, Andreas ; Slaney, Malcolm
Author_Institution :
Microsoft Res., Mountain View, CA, USA
Abstract :
Speaker diarization finds contiguous speaker segments in an audio recording and clusters them by speaker identity, without any a-priori knowledge. Diarization is typically based on short-term spectral features such as Mel-frequency cepstral coefficients (MFCCs). Though these features carry average information about the vocal tract characteristics of a speaker, they are also susceptible to factors unrelated to the speaker identity. In this study, we propose an artificial neural network (ANN) architecture to learn a feature transform that is optimized for speaker diarization. We train a multi-hidden-layer ANN to judge whether two given speech segments came from the same or different speakers, using a shared transform of the input features that feeds into a bottleneck layer. We then use the bottleneck layer activations as features, either alone or in combination with baseline MFCC features in a multistream mode, for speaker diarization on test data. The resulting system is evaluated on various corpora of multi-party meetings. A combination of MFCC and ANN features gives up to 14% relative reduction in diarization error, demonstrating that these features are providing an additional independent source of knowledge.
Keywords :
neural nets; speaker recognition; transforms; MFCC; artificial neural network architecture; artificial neural network features; audio recording; contiguous speaker segments; feature transform; mel-frequency cepstral coefficients; multihidden-layer ANN; multiparty meetings; multistream mode; shared transform; short-term spectral features; speaker diarization; speaker identity; speech segments; vocal tract characteristics; Abstracts; Artificial neural networks; Density estimation robust algorithm; ISO standards; Mel frequency cepstral coefficient; Noise; artificial neural networks; discriminative feature extraction; speaker diarization;
Conference_Titel :
Spoken Language Technology Workshop (SLT), 2014 IEEE
DOI :
10.1109/SLT.2014.7078608