Artificial neural network features for speaker diarization

Author

Yella, Sree Harsha ; Stolcke, Andreas ; Slaney, Malcolm

Author_Institution

Microsoft Res., Mountain View, CA, USA

fYear

2014

Firstpage

402

Lastpage

406

Abstract

Speaker diarization finds contiguous speaker segments in an audio recording and clusters them by speaker identity, without any a-priori knowledge. Diarization is typically based on short-term spectral features such as Mel-frequency cepstral coefficients (MFCCs). Though these features carry average information about the vocal tract characteristics of a speaker, they are also susceptible to factors unrelated to the speaker identity. In this study, we propose an artificial neural network (ANN) architecture to learn a feature transform that is optimized for speaker diarization. We train a multi-hidden-layer ANN to judge whether two given speech segments came from the same or different speakers, using a shared transform of the input features that feeds into a bottleneck layer. We then use the bottleneck layer activations as features, either alone or in combination with baseline MFCC features in a multistream mode, for speaker diarization on test data. The resulting system is evaluated on various corpora of multi-party meetings. A combination of MFCC and ANN features gives up to 14% relative reduction in diarization error, demonstrating that these features are providing an additional independent source of knowledge.

Keywords

neural nets; speaker recognition; transforms; MFCC; artificial neural network architecture; artificial neural network features; audio recording; contiguous speaker segments; feature transform; mel-frequency cepstral coefficients; multihidden-layer ANN; multiparty meetings; multistream mode; shared transform; short-term spectral features; speaker diarization; speaker identity; speech segments; vocal tract characteristics; Abstracts; Artificial neural networks; Density estimation robust algorithm; ISO standards; Mel frequency cepstral coefficient; Noise; artificial neural networks; discriminative feature extraction; speaker diarization;

fLanguage

English

Publisher

ieee

Conference_Titel

Spoken Language Technology Workshop (SLT), 2014 IEEE

Type

conf

DOI

10.1109/SLT.2014.7078608

Filename

7078608