Title of article :
PHMM: Stemming on Persian Texts using Statistical Stemmer Based on Hidden Markov Model
Author/Authors :
momenipour, fatemeh islamic azad university, qazvin branch - department of computer engineering, ايران , keyvanpour, mohammad reza alzahra university - department of computer engineering, ايران
Abstract :
Stemming is the process of finding the main morpheme of a word and it is used in natural language processing, text mining and information retrieval systems. A stemmer extracts the stem of the words. Persian stemmers are classified into three main classes: structural stemmers, dictionary based stemmers, and statistical stemmers. The precision of structural stemmers is low and the expenses of dictionary based stemmers is high; therefore, the main goal of this research was to design and implement a statistical stemmer based on Hidden Markov Model with high precision in order to reduce the size of indexed file and increase the speed of information retrieval systems. In the present study, the proposed stemmer finds the prefixes and suffixes of a word and removes them, so that the rest of the word is considered to be the stem. But there are some exceptions in Persian words which would be considered as a stem mistakenly. So, at first a dictionary of Persian stemmers was collected and after that the proposed stemmer searched a word in the dictionary, if the word was not there, the stemmer found the stem of it by HMM based stemmer. This stemmer was tested in Bijankhan corpus and Hamshahri test collection. The results showed increment in mean average precision and recall. The speed of the Information retrieval system was increased and the size of indexed files were decreased by the algorithm.
Keywords :
Stem , Stemmers , Hidden Markov Model , Persian Words.
Journal title :
International Journal of Information Science and Management (IJISM)
Journal title :
International Journal of Information Science and Management (IJISM)