A Statistical Based Part of Speech Tagger for Urdu Language

Author

Anwar, Waqas ; Wang, Xuan ; Li, Lu ; Wang, Xiao-long

Author_Institution

Harbin Inst. of Technol., Harbin

Volume

6

fYear

2007

fDate

19-22 Aug. 2007

Firstpage

3418

Lastpage

3424

Abstract

In this paper we present a pioneering step in designing n-gram based part of speech tagger for the Urdu language. In the last few years part of speech tagging work has been done in the area of supposed English, South Asian and European languages. In this paper our focus of attention is on the disambiguation problem (to assign the accurate tag for every word of a set of possible tags). Our approach employs n-gram Markov model, train from annotated Urdu corpus and assigns possible tags to text. The proposed n-gram part of speech tagger has been tested which achieved state of the art performance of 95.0%. Furthermore, we check our experiment results of two type of tagset. Along the way, we apply evaluation method that shows how significant our experiment results are. Besides, we present the error analysis (confusion matrix) and show the tagging example of Urdu tagging. We also present overview of Urdu language. The contribution of our work is an initial step of statistical based Urdu part of speech tagger.

Keywords

Markov processes; computational linguistics; natural language processing; speech processing; Urdu language; confusion matrix; disambiguation problem; error analysis; n-gram Markov model; speech tagging; statistical based part of speech tagger; Computer science; Cybernetics; Data mining; Error analysis; Machine learning; Natural languages; Speech analysis; Speech processing; Tagging; Testing; Language model; Part-of-speech tagging; Urdu language;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Cybernetics, 2007 International Conference on

Conference_Location

Hong Kong

Print_ISBN

978-1-4244-0973-0

Electronic_ISBN

978-1-4244-0973-0

Type

conf

DOI

10.1109/ICMLC.2007.4370739

Filename

4370739