• DocumentCode
    3528142
  • Title

    Multi-modal speaker diarization of real-world meetings using compressed-domain video features

  • Author

    Friedland, Gerald ; Hung, Hayley ; Yeo, Chuohao

  • Author_Institution
    Int. Comput. Sci. Inst., Berkeley, CA
  • fYear
    2009
  • fDate
    19-24 April 2009
  • Firstpage
    4069
  • Lastpage
    4072
  • Abstract
    Speaker diarization is originally defined as the task of determining ldquowho spoke whenrdquo given an audio track and no other prior knowledge of any kind. The following article shows a multi-modal approach where we improve a state-of-the-art speaker diarization system by combining standard acoustic features (MFCCs) with compressed domain video features. The approach is evaluated on over 4.5 hours of the publicly available AMI meetings dataset which contains challenges such as people standing up and walking out of the room. We show a consistent improvement of about 34% relative in speaker error rate (21% DER) compared to a state-of-the-art audio-only baseline.
  • Keywords
    cepstral analysis; data compression; speaker recognition; video coding; AMI meetings dataset; MFCC; compressed-domain video features; multimodal speaker diarization; real-world meetings; speaker error rate; standard acoustic features; Ambient intelligence; Cameras; Computer science; Density estimation robust algorithm; Error analysis; Legged locomotion; Loudspeakers; Mouth; Speech; Video compression; Speaker extraction; compressed domain features; multi-modal;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on
  • Conference_Location
    Taipei
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4244-2353-8
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2009.4960522
  • Filename
    4960522