A hybrid Parts Of Speech tagger for Malayalam language

Author

Anisha Aziz T; Sunitha C

Author_Institution

Department of Computer Science and Engg., VAST, Thalakkottukara, Thrissur, Kerala, India

fYear

2015

Firstpage

1502

Lastpage

1507

Abstract

Parts of speech tagging is an important research topic in Natural Language Processing research are. Since it is one among the first steps of any natural language processing (NLP) techniques such as machine translation, if any error happens for tagging the same will repeat in the whole NLP process. So far works had been done on POS tagging based on SVM, MBLP, HMM, Ngram. All of these methods were not fixing the problem of ambiguity. So for fixing ambiguity, we put forward a new Hybrid tagger for Malayalam. The combination of traditional rules and n-gram may produce better result compared to other methodologies. And also the ambiguity will be reduced by enriching the bigram dictionary. A bigram dictionary of co-occurring words are built with their tags. About 100000 more words are there in bigram dictionary. A corpus for Malayalam must be built which may be supposed to access by the model. It contains about 100000 words which are Malayalam words as well as the words originated from English. Since it´s a hybrid tagger, we can take advantage of both traditional rules as well as bigrams. Also the heart of the research is the rule set, which contains 267 manually created rules. Rules can be applied with help of a morph analyzer. Rules are also used for tagging if bigram and corpus can´t be referred for tagging. The proposed method when tested on 150 words, only 11 words were not identified, and obtained 90.5% accuracy. For the unidentified words, it can be caused by either the root word may not be in corpus or bigram, or the absence of rule. So adding the word, bigram or rule, we can improve the result and enhance the work. Addition is simple task. The size of bigram dictionary, corpus, and rule set and accuracy of morph analyzer influences the performance of the system.

Keywords

"Tagging","Dictionaries","Speech","Training","Natural language processing","Accuracy","Hidden Markov models"

Publisher

ieee

Conference_Titel

Advances in Computing, Communications and Informatics (ICACCI), 2015 International Conference on

Print_ISBN

978-1-4799-8790-0

Type

conf

DOI

10.1109/ICACCI.2015.7275825

Filename

7275825