DocumentCode
1936587
Title
A Statistical Based Part of Speech Tagger for Urdu Language
Author
Anwar, Waqas ; Wang, Xuan ; Li, Lu ; Wang, Xiao-long
Author_Institution
Harbin Inst. of Technol., Harbin
Volume
6
fYear
2007
fDate
19-22 Aug. 2007
Firstpage
3418
Lastpage
3424
Abstract
In this paper we present a pioneering step in designing n-gram based part of speech tagger for the Urdu language. In the last few years part of speech tagging work has been done in the area of supposed English, South Asian and European languages. In this paper our focus of attention is on the disambiguation problem (to assign the accurate tag for every word of a set of possible tags). Our approach employs n-gram Markov model, train from annotated Urdu corpus and assigns possible tags to text. The proposed n-gram part of speech tagger has been tested which achieved state of the art performance of 95.0%. Furthermore, we check our experiment results of two type of tagset. Along the way, we apply evaluation method that shows how significant our experiment results are. Besides, we present the error analysis (confusion matrix) and show the tagging example of Urdu tagging. We also present overview of Urdu language. The contribution of our work is an initial step of statistical based Urdu part of speech tagger.
Keywords
Markov processes; computational linguistics; natural language processing; speech processing; Urdu language; confusion matrix; disambiguation problem; error analysis; n-gram Markov model; speech tagging; statistical based part of speech tagger; Computer science; Cybernetics; Data mining; Error analysis; Machine learning; Natural languages; Speech analysis; Speech processing; Tagging; Testing; Language model; Part-of-speech tagging; Urdu language;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Cybernetics, 2007 International Conference on
Conference_Location
Hong Kong
Print_ISBN
978-1-4244-0973-0
Electronic_ISBN
978-1-4244-0973-0
Type
conf
DOI
10.1109/ICMLC.2007.4370739
Filename
4370739
Link To Document