DocumentCode :
2707590
Title :
Comparison of different lemmatization approaches for information retrieval on Turkish text collection
Author :
Ozturkmenoglu, Okan ; Alpkocak, Adil
Author_Institution :
Dept. of Comput. Eng., Dokuz Eylul Univ., Izmir, Turkey
fYear :
2012
fDate :
2-4 July 2012
Firstpage :
1
Lastpage :
5
Abstract :
In this paper, we compare the performance of different lemmatization approaches for information retrieval over Turkish text collection. A lemma is simply the "dictionary form" of a word and lemmatization is the process of determining the lemma for a given word where different inflected forms of a word can be analyzed as a single item. We compared three different lemmatizer and one fixed length truncation approaches over Turkish text collection. The first one is based on morphological analyzer for Turkish using with finite state language processing technology; another one is Dictionary-based Turkish Lemmatizer (DTL), which uses radix-trie data structure; the third one is a simple dictionary based top-down parser and the last one is truncation of words at fix length. We have assessed the performance of lemmatizers on Bilkent University Milliyet collection, which contains more than 400K documents. The comparison of performance analysis was done by the well-known IR evaluation metrics and experimented in the IR system. The results we obtained show that the lemmatization process improves IR performance and we achieved the best results using with Turkish Lemmatizer that is DTL radix-trie data structure and it used the minimum number of terms in IR system.
Keywords :
data structures; finite state machines; information retrieval; natural languages; text analysis; Bilkent University Milliyet collection; DTL radix-trie data structure; IR evaluation metrics; Turkish text collection; dictionary form; dictionary-based Turkish lemmatizer; finite state language processing technology; fixed length truncation approaches; information retrieval; morphological analyzer; simple dictionary-based top-down parser; Data structures; Dictionaries; Educational institutions; Indexing; Pragmatics; Information Retrieval; Lemmatization; Normalization; Turkish Information Retrieval;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Innovations in Intelligent Systems and Applications (INISTA), 2012 International Symposium on
Conference_Location :
Trabzon
Print_ISBN :
978-1-4673-1446-6
Type :
conf
DOI :
10.1109/INISTA.2012.6246934
Filename :
6246934
Link To Document :
بازگشت