Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora

Author

Huang, Rongqing ; Hansen, John H L

Volume

Issue

fYear

2006

fDate

5/1/2006 12:00:00 AM

Firstpage

907

Lastpage

919

Abstract

The problem of unsupervised audio classification and segmentation continues to be a challenging research problem which significantly impacts automatic speech recognition (ASR) and spoken document retrieval (SDR) performance. This paper addresses novel advances in 1) audio classification for speech recognition and 2) audio segmentation for unsupervised multispeaker change detection. A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN). Two new extended-time features: variance of the spectrum flux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. The classification is then implemented using weighted GMM networks. Since historically there have been no features specifically designed for audio segmentation, we evaluate 16 potential features including three new proposed features: perceptual minimum variance distortionless response (PMVDR), smoothed zero-crossing rate (SZCR), and filterbank log energy coefficients (FBLC) in 14 noisy environments to determine the best robust features on the average across these conditions. Next, a new distance metric, T²-mean, is proposed which is intended to improve segmentation for short segment turns (i.e., 1-5 s). A new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate. Evaluations on a standard data set-Defense Advanced Research Projects Agency (DARPA) Hub4 Broadcast News 1997 evaluation data-show that the WGN classification algorithm achieves over a 50% improvement versus the GMM network baseline algorithm, and the proposed compound segmentation algorithm achieves 23%-10% improvement in all metrics versus the baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm. The new classification and segmentation algorithms also obtain very satisfactory results on the more diverse and challenging National Gallery of the Spoken Word (NGSW) corpus.

Keywords

Gaussian processes; audio signal processing; broadcasting; channel bank filters; information retrieval; speaker recognition; speech processing; unsupervised learning; Gaussian mixture model; National Gallery of the Spoken Word corpus; audio segmentation; automatic speech recognition; broadcast news; filterbank log energy coefficients; perceptual minimum variance distortionless response; smoothed zero-crossing rate; spoken document retrieval; unsupervised audio classification; unsupervised multispeaker change detection; variance of the spectrum flux; variance of the zero-crossing rate; Automatic speech recognition; Broadcasting; Cepstral analysis; Change detection algorithms; Classification algorithms; Costs; Filter bank; Robustness; Speech recognition; Working environment noise; Audio classification; Bayesian information criterion; Gaussian mixture model (GMM) networks; audio segmentation; broadcast news transcription; feature analysis; feature processing; noisy environments; rich transcription; speaker segmentation; spoken document retrieval;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher

ieee

ISSN

1558-7916

Type

jour

DOI

10.1109/TSA.2005.858057

Filename

1621203

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=900333