• DocumentCode
    417247
  • Title

    Advances in unsupervised audio segmentation for the broadcast news and NGSW corpora

  • Author

    Huang, Rongqing ; Hansen, John H L

  • Author_Institution
    Center for Spoken Language Res., Univ. of Colorado, Boulder, CO, USA
  • Volume
    1
  • fYear
    2004
  • fDate
    17-21 May 2004
  • Abstract
    The problem of unsupervised audio segmentation continues to be a challenging research problem which significantly impacts automatic speech recognition (ASR) and spoken document retrieval (SDR) performance. This paper addresses novel advances in audio segmentation for unsupervised multi-speaker change detection. First, we investigate new features which are intended to be more appropriate for segmentation that include: PMVDR (perceptual minimum variance distortionless response), SZCR ( smoothed zero crossing rate), and FBLC (filterbank log coefficients); next we consider a new distance metric, T2-mean which is intended to improve segmentation for short segments (<5s). A novel false alarm compensation procedure is also developed and used after the segmentation phase. We establish a more effective evaluation procedure for segmentation versus the more traditional EER and frame accuracy approaches. Employing these advances within our new scheme, results in more than a 30% improvement in segmentation performance using the 3-hour Hub4 broadcast news 1997 evaluation data. Evaluations are also presented for audio from the NGSW corpus.
  • Keywords
    information retrieval; smoothing methods; speech recognition; ASR; FBLC; Hub4 broadcast news 1997; NGSW corpora; PMVDR; SDR performance; SZCR; T2-mean; automatic speech recognition; distance metric; false alarm compensation; filterbank log coefficients; multi-speaker change detection; perceptual minimum variance distortionless response; smoothed zero crossing rate; spoken document retrieval; unsupervised audio segmentation; Acoustic distortion; Automatic speech recognition; Bayesian methods; Broadcasting; Loudspeakers; Mel frequency cepstral coefficient; Natural languages; Robustness; Speech processing; Streaming media;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on
  • ISSN
    1520-6149
  • Print_ISBN
    0-7803-8484-9
  • Type

    conf

  • DOI
    10.1109/ICASSP.2004.1326092
  • Filename
    1326092