• DocumentCode
    3585061
  • Title

    Artificial neural network features for speaker diarization

  • Author

    Yella, Sree Harsha ; Stolcke, Andreas ; Slaney, Malcolm

  • Author_Institution
    Microsoft Res., Mountain View, CA, USA
  • fYear
    2014
  • Firstpage
    402
  • Lastpage
    406
  • Abstract
    Speaker diarization finds contiguous speaker segments in an audio recording and clusters them by speaker identity, without any a-priori knowledge. Diarization is typically based on short-term spectral features such as Mel-frequency cepstral coefficients (MFCCs). Though these features carry average information about the vocal tract characteristics of a speaker, they are also susceptible to factors unrelated to the speaker identity. In this study, we propose an artificial neural network (ANN) architecture to learn a feature transform that is optimized for speaker diarization. We train a multi-hidden-layer ANN to judge whether two given speech segments came from the same or different speakers, using a shared transform of the input features that feeds into a bottleneck layer. We then use the bottleneck layer activations as features, either alone or in combination with baseline MFCC features in a multistream mode, for speaker diarization on test data. The resulting system is evaluated on various corpora of multi-party meetings. A combination of MFCC and ANN features gives up to 14% relative reduction in diarization error, demonstrating that these features are providing an additional independent source of knowledge.
  • Keywords
    neural nets; speaker recognition; transforms; MFCC; artificial neural network architecture; artificial neural network features; audio recording; contiguous speaker segments; feature transform; mel-frequency cepstral coefficients; multihidden-layer ANN; multiparty meetings; multistream mode; shared transform; short-term spectral features; speaker diarization; speaker identity; speech segments; vocal tract characteristics; Abstracts; Artificial neural networks; Density estimation robust algorithm; ISO standards; Mel frequency cepstral coefficient; Noise; artificial neural networks; discriminative feature extraction; speaker diarization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Spoken Language Technology Workshop (SLT), 2014 IEEE
  • Type

    conf

  • DOI
    10.1109/SLT.2014.7078608
  • Filename
    7078608