مرکز منطقه ای اطلاع رساني علوم و فناوري - Dialect Classification on Printed Text using Perplexity Measure and Conditional Random Fields

DocumentCode :

2704631

Title :

Dialect Classification on Printed Text using Perplexity Measure and Conditional Random Fields

Author :

Rongqing Huang ; Hansen, John H. L.

Author_Institution :

Erik Jonsson Sch. of Eng. & Comput. Sci., Texas Univ., Dallas, TX, USA

Volume :

fYear :

2007

fDate :

15-20 April 2007

Abstract :

Studies have shown that dialect variation has a significant impact in speech recognition performance, and therefore it is important to be able to perform effective dialect classification to improve speech systems. Dialects differ at the acoustic, grammar, and vocabulary levels. In this study, topic-specific printed text dialect data are collected from the ten major newspapers in Australia, United Kingdom, and United States. An n-gram language model is trained for each topic in each country/dialect. The perplexity measure is applied to classify the dialect-dependent documents. In addition to the n-gram information, further features can be extracted from text structure. Conditional random fields (CRF) is such a model which can extract different levels of features and is still mathematically tractable. The CRF is applied to train the language model and classify documents. Significant improvement on dialect classification is achieved by using the CRF based classifier, especially on the small size documents (10% to 22% relative error reduction). Text classification based on variable size documents is explored and a document with several hundred words is shown to be sufficient for dialect classification. The vocabulary difference among the text documents from different countries are explored and the dialect difference is smoothly connected with the vocabulary difference. Five document topics are evaluated and performance for cross topic dialect classification is explored.

Keywords :

linguistics; random processes; speech processing; speech recognition; Australia; United Kingdom; United States; conditional random fields; dialect classification; dialect variation; n-gram language model; perplexity measure; speech recognition; text classification; topic-specific printed text dialect data; variable size documents; Acoustic measurements; Australia; Computer science; Data mining; Feature extraction; Mathematical model; Natural languages; Robustness; Speech recognition; Vocabulary; Conditional Random Fields; Dialect classification; n-gram language model; text classification;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on

Conference_Location :

Honolulu, HI

ISSN :

1520-6149

Print_ISBN :

1-4244-0727-3

Type :

conf

DOI :

10.1109/ICASSP.2007.367239

Filename :

4218270

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2704631