Title :
Using audio and visual cues for speaker diarisation initialisation
Author :
Garau, Giulia ; Bourlard, Hervé
Author_Institution :
Idiap Res. Inst., Martigny, Switzerland
Abstract :
In this paper we present a novel approach to audio visual speaker diarisation (the task of estimating “who spoke when” using audio and visual cues) in a challenging meeting domain. Our approach is based on the initialisation of the agglomerative speaker clustering using psychology inspired visual features, including Visual Focus of Attention (VFoA) and motion intensities. This method, providing initial speaker clusters of high purity, achieved consistent improvements over the widely adopted linear initialisation method. Moreover, the initialisation using both visual and Time Delay of Arrival (TDoA) cues was also investigated in conjunction with the multi-stream combination of acoustic and visual features (MFCC, TDoA, VFoA, motion intensity, and head pose likelihoods). This speaker diarisation framework allowed to successfully integrate three feature streams, further exploiting the complementarity between multimodal cues.
Keywords :
pattern clustering; speaker recognition; time-of-arrival estimation; acoustic features; agglomerative speaker clustering; audio cues; head pose likelihoods; linear initialisation method; motion intensities; motion intensity; multimodal cues; psychology inspired visual features; speaker diarisation initialisation; time delay of arrival cues; visual cues; visual focus of attention; Clustering algorithms; Delay effects; Information management; Loudspeakers; Mel frequency cepstral coefficient; Merging; Microphone arrays; Psychology; Speech; Streaming media; Audio Visual speaker diarisation; clustering initialisation;
Conference_Titel :
Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on
Conference_Location :
Dallas, TX
Print_ISBN :
978-1-4244-4295-9
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2010.5495101