Title :
NLTK tagger for Albanian using iterative approach
Author_Institution :
South East Eur. Univ., Tetove, Macedonia
Abstract :
This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.
Keywords :
dictionaries; iterative methods; natural language processing; text analysis; Albanian language; Albanian text; NLTK tagger; NLTK toolkit; POS tags; dictionary; iterative approach; lemmatize module; lemmatized words; lookup tagger; nouns; regular expressions rules; regular expressions tagger; taggers cascading; tagging model; text tagging; unigram tagger; verbs; Accuracy; Dictionaries; Economics; Hidden Markov models; Mood; Tagging; Training; Albanian language; NLTK; POS tagging;
Conference_Titel :
Information Technology Interfaces (ITI), Proceedings of the ITI 2013 35th International Conference on
Conference_Location :
Cavtat
Print_ISBN :
978-953-7138-30-1
DOI :
10.2498/iti.2013.0565