DocumentCode :
2704631
Title :
Dialect Classification on Printed Text using Perplexity Measure and Conditional Random Fields
Author :
Rongqing Huang ; Hansen, John H. L.
Author_Institution :
Erik Jonsson Sch. of Eng. & Comput. Sci., Texas Univ., Dallas, TX, USA
Volume :
4
fYear :
2007
fDate :
15-20 April 2007
Abstract :
Studies have shown that dialect variation has a significant impact in speech recognition performance, and therefore it is important to be able to perform effective dialect classification to improve speech systems. Dialects differ at the acoustic, grammar, and vocabulary levels. In this study, topic-specific printed text dialect data are collected from the ten major newspapers in Australia, United Kingdom, and United States. An n-gram language model is trained for each topic in each country/dialect. The perplexity measure is applied to classify the dialect-dependent documents. In addition to the n-gram information, further features can be extracted from text structure. Conditional random fields (CRF) is such a model which can extract different levels of features and is still mathematically tractable. The CRF is applied to train the language model and classify documents. Significant improvement on dialect classification is achieved by using the CRF based classifier, especially on the small size documents (10% to 22% relative error reduction). Text classification based on variable size documents is explored and a document with several hundred words is shown to be sufficient for dialect classification. The vocabulary difference among the text documents from different countries are explored and the dialect difference is smoothly connected with the vocabulary difference. Five document topics are evaluated and performance for cross topic dialect classification is explored.
Keywords :
linguistics; random processes; speech processing; speech recognition; Australia; United Kingdom; United States; conditional random fields; dialect classification; dialect variation; n-gram language model; perplexity measure; speech recognition; text classification; topic-specific printed text dialect data; variable size documents; Acoustic measurements; Australia; Computer science; Data mining; Feature extraction; Mathematical model; Natural languages; Robustness; Speech recognition; Vocabulary; Conditional Random Fields; Dialect classification; n-gram language model; text classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on
Conference_Location :
Honolulu, HI
ISSN :
1520-6149
Print_ISBN :
1-4244-0727-3
Type :
conf
DOI :
10.1109/ICASSP.2007.367239
Filename :
4218270
Link To Document :
بازگشت