Title :
Probabilistic Speaker Diarization With Bag-of-Words Representations of Speaker Angle Information
Author :
Ishiguro, Katsuhiko ; Yamada, Takeshi ; Araki, Shoko ; Nakatani, Tomohiro ; Sawada, Hiroshi
Author_Institution :
NTT Commun. Sci. Labs., NTT Corp., Kyoto, Japan
Abstract :
Speaker diarization determines “who spoke when” from the recorded conversations of an unknown number of people. In general, we have no a priori information about the number, the locations, or even the characteristics of the speakers. Additionally, speakers´ speech utterances vary dynamically because of turn-taking during the conversations. These conditions make the speaker-clustering task extremely difficult. The problem becomes even harder if online (incremental) processing is required. In this paper, we formulate the speaker-clustering problem as the clustering of the sequential audio features generated by an unknown number of latent mixture components (speakers). We employ a probabilistic model that assumes time-sensitive speaker mixtures at every time frame, which, surprisingly, suits the diarization scenario. We combine the time-varying probabilistic model with direction of arrival (DOA) information calculated from a microphone array in a bag-of-words (BoW)-style feature representation. The proposed system effectively estimates the number and locations of the speakers in an online manner based on the standard Bayes inference scheme. Experiments confirm that the proposed model can successfully infer the number and features of speakers and yield better or comparable speaker diarization results compared with conventional methods in several datasets.
Keywords :
audio signal processing; belief networks; direction-of-arrival estimation; feature extraction; inference mechanisms; microphone arrays; pattern clustering; probability; speaker recognition; BoW-style feature representation; DOA information; bag-of-words-style feature representation; direction of arrival information; incremental processing; microphone array; online processing; probabilistic speaker diarization; sequential audio feature clustering; speaker angle information representation; speaker location estimation; speaker-clustering problem; speech utterances; standard Bayes inference scheme; time-sensitive speaker mixtures; time-varying probabilistic model; Computational modeling; Direction of arrival estimation; Feature extraction; Mel frequency cepstral coefficient; Microphones; Probabilistic logic; Speech; Bag-of-words (BOW); clustering; direction of arrival (DOA); latent Dirichlet allocation (LDA); microphone arrays; speaker diarization; variational Bayes inference;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2011.2151858