Title :
Phonetic speaker recognition using maximum-likelihood binary-decision tree models
Author :
Navrátil, Jiri ; Jin, Qin ; Andrews, Walter D. ; Campbell, Joseph P.
Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
Abstract :
Recent work in phonetic speaker recognition has shown that modeling phone sequences using n-grams is a viable and effective approach to speaker recognition, primarily aiming at capturing speaker-dependent pronunciation and also word usage. The paper describes a method involving binary-tree-structured statistical models for extending the phonetic context beyond that of standard n-grams (particularly bigrams) by exploiting statistical dependencies within a longer sequence window without exponentially increasing the model complexity, as is the case with n-grams. Two ways of dealing with data sparsity are also studied; namely, model adaptation and a recursive bottom-up smoothing of symbol distributions. Results obtained under a variety of experimental conditions using the NIST 2001 Speaker Recognition Extended Data Task indicate consistent improvements in equal-error rate performance as compared to standard bigram models. The described approach confirms the relevance of long phonetic context in phonetic speaker recognition and represents an intermediate stage between short phone context and word-level modeling without the need for any lexical knowledge, which suggests its language independence.
Keywords :
computational complexity; decision trees; error statistics; smoothing methods; speaker recognition; speech processing; statistical analysis; NIST 2001 Speaker Recognition Extended Data Task; NIST Extended Data Task; bigrams; binary-decision tree structure; binary-tree-structured statistical models; data sparsity; equal-error rate; lexical knowledge; maximum-likelihood binary-decision tree models; n-grams; phone sequences; phonetic speaker recognition; recursive bottom-up smoothing; recursive smoothing; speaker-dependent pronunciation; symbol distributions; word usage; Adaptation model; Automatic speech recognition; Context modeling; Laboratories; Loudspeakers; NIST; Natural languages; Smoothing methods; Speaker recognition; Training data;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on
Print_ISBN :
0-7803-7663-3
DOI :
10.1109/ICASSP.2003.1202763