• DocumentCode
    1936587
  • Title

    A Statistical Based Part of Speech Tagger for Urdu Language

  • Author

    Anwar, Waqas ; Wang, Xuan ; Li, Lu ; Wang, Xiao-long

  • Author_Institution
    Harbin Inst. of Technol., Harbin
  • Volume
    6
  • fYear
    2007
  • fDate
    19-22 Aug. 2007
  • Firstpage
    3418
  • Lastpage
    3424
  • Abstract
    In this paper we present a pioneering step in designing n-gram based part of speech tagger for the Urdu language. In the last few years part of speech tagging work has been done in the area of supposed English, South Asian and European languages. In this paper our focus of attention is on the disambiguation problem (to assign the accurate tag for every word of a set of possible tags). Our approach employs n-gram Markov model, train from annotated Urdu corpus and assigns possible tags to text. The proposed n-gram part of speech tagger has been tested which achieved state of the art performance of 95.0%. Furthermore, we check our experiment results of two type of tagset. Along the way, we apply evaluation method that shows how significant our experiment results are. Besides, we present the error analysis (confusion matrix) and show the tagging example of Urdu tagging. We also present overview of Urdu language. The contribution of our work is an initial step of statistical based Urdu part of speech tagger.
  • Keywords
    Markov processes; computational linguistics; natural language processing; speech processing; Urdu language; confusion matrix; disambiguation problem; error analysis; n-gram Markov model; speech tagging; statistical based part of speech tagger; Computer science; Cybernetics; Data mining; Error analysis; Machine learning; Natural languages; Speech analysis; Speech processing; Tagging; Testing; Language model; Part-of-speech tagging; Urdu language;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2007 International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    978-1-4244-0973-0
  • Electronic_ISBN
    978-1-4244-0973-0
  • Type

    conf

  • DOI
    10.1109/ICMLC.2007.4370739
  • Filename
    4370739