Title :
Sample selection for automatic language identification
Author :
Farris, David ; White, Chris ; Khudanpur, Sanjeev
Author_Institution :
Center for Language & Speech Process., Johns Hopkins Univ., Baltimore, MD
fDate :
March 31 2008-April 4 2008
Abstract :
Current approaches to automatic spoken language identification (LID) assume the availability of a large corpus of manually language-labeled speech samples for training statistical classifiers. We investigate two methods of active learning to significantly reduce the amount of labeled speech needed for training LID systems. Starting with a small training set, an automated method is used to select samples from a corpus of unlabeled speech, which are then labeled and added to the training pool - one selection method is based on a previously known entropy criterion, and another on a novel likelihood-ratio criterion. We demonstrate LID performance comparable to a large training corpus using only a tenth of the training data. A further 40% improvement in LID performance is obtained using a third of the training data. Finally, we show that our novel selection method is more robust to variance in the unlabeled pool than the entropy based method.
Keywords :
entropy; natural language processing; speech recognition; automatic language identification; entropy criterion; language-labeled speech samples; likelihood-ratio criterion; sample selection; spoken language identification; statistical classifiers; Costs; Error analysis; Iterative algorithms; Iterative methods; Natural languages; Partitioning algorithms; Sampling methods; Speech processing; Training data; Uncertainty; natural languages; speech processing; unsupervised learning;
Conference_Titel :
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4244-1483-3
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2008.4518587