مرکز منطقه ای اطلاع رساني علوم و فناوري - Using audio and visual cues for speaker diarisation initialisation

DocumentCode :

2791141

Title :

Using audio and visual cues for speaker diarisation initialisation

Author :

Garau, Giulia ; Bourlard, Hervé

Author_Institution :

Idiap Res. Inst., Martigny, Switzerland

fYear :

2010

fDate :

14-19 March 2010

Firstpage :

4942

Lastpage :

4945

Abstract :

In this paper we present a novel approach to audio visual speaker diarisation (the task of estimating “who spoke when” using audio and visual cues) in a challenging meeting domain. Our approach is based on the initialisation of the agglomerative speaker clustering using psychology inspired visual features, including Visual Focus of Attention (VFoA) and motion intensities. This method, providing initial speaker clusters of high purity, achieved consistent improvements over the widely adopted linear initialisation method. Moreover, the initialisation using both visual and Time Delay of Arrival (TDoA) cues was also investigated in conjunction with the multi-stream combination of acoustic and visual features (MFCC, TDoA, VFoA, motion intensity, and head pose likelihoods). This speaker diarisation framework allowed to successfully integrate three feature streams, further exploiting the complementarity between multimodal cues.

Keywords :

pattern clustering; speaker recognition; time-of-arrival estimation; acoustic features; agglomerative speaker clustering; audio cues; head pose likelihoods; linear initialisation method; motion intensities; motion intensity; multimodal cues; psychology inspired visual features; speaker diarisation initialisation; time delay of arrival cues; visual cues; visual focus of attention; Clustering algorithms; Delay effects; Information management; Loudspeakers; Mel frequency cepstral coefficient; Merging; Microphone arrays; Psychology; Speech; Streaming media; Audio Visual speaker diarisation; clustering initialisation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on

Conference_Location :

Dallas, TX

ISSN :

1520-6149

Print_ISBN :

978-1-4244-4295-9

Electronic_ISBN :

1520-6149

Type :

conf

DOI :

10.1109/ICASSP.2010.5495101

Filename :

5495101

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2791141