• DocumentCode
    3539775
  • Title

    Learning a stochastic part of speech tagger for sinhala

  • Author

    Jayasuriya, M. ; Weerasinghe, A.R.

  • Author_Institution
    Virtusa (Pvt) Ltd., Colombo, Sri Lanka
  • fYear
    2013
  • fDate
    11-15 Dec. 2013
  • Firstpage
    137
  • Lastpage
    143
  • Abstract
    This paper presents the results of developing a part of speech (POS) tagger for Sinhala. The tagger is able to handle lexical items with multiple POS tags while also predicting POS tags of previously unseen words. A stochastic approach, Hidden Markov Model (HMM) with tri-gram probabilities was used as the training and tagging model. Linear Interpolation is used to smoothen the tri-gram probabilities while the Viterbi algorithm is used to decode the results of the HMM to decide on the best POS tags for each word. The tagger learns the lexical items (words and their possible POS tags) and the tri-gram probabilities using a POS tag annotated corpus. The tagger achieved an overall accuracy of 62%. Approximately 24% of the errors were for words whose POS tags have been unknown in the corpus. The lack of a Named Entity recognizer has also contributed to 10% of the overall error.
  • Keywords
    hidden Markov models; interpolation; learning (artificial intelligence); natural language processing; speech recognition; HMM; POS tagger; Sinhala language; Viterbi algorithm; hidden Markov model; learning; lexical items; linear interpolation; named entity recognizer; part-of-speech tagger; stochastic approach; tagging model; training model; tri-gram probabilities; Accuracy; Hidden Markov models; Probability; Speech; Stochastic processes; Tagging; Training; Hidden Markov Model; Linear Interpolation; Part of speech tagging; Sinhala language; Viterbi algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on
  • Conference_Location
    Colombo
  • Print_ISBN
    978-1-4799-1275-9
  • Type

    conf

  • DOI
    10.1109/ICTer.2013.6761168
  • Filename
    6761168