Title of article :
Towards Corpus-Based Stemming for Arabic Texts
Author/Authors :
Sabtan ، Yasser Muhammad Naguib - Dhofar University
Pages :
11
From page :
119
To page :
129
Abstract :
Stemming is an essential processing step in a number of natural language processing (NLP) applications such as information extraction, text analysis and machine translation. It is the process of reducing words to their stems. This paper presents a light stemmer for Arabic, using a corpus-based approach. The stemmer groups morphological variants of words in an Arabic corpus based on shared characters, before stripping off their affixes (prefixes and suffixes) to produce their common stem. Experimental results show that 86% of words in the test set were correctly grouped under a similar reduced form (i.e. the possible stem). In some cases the reduced form is not the legitimate stem. The evaluation shows that 72.2% of the words in the test set were reduced to their legitimate stem. The current stemmer is developed with the future aim of investigating the effectiveness of using word stems for extracting bilingual equivalents from an Arabic-English parallel corpus
Keywords :
Arabic , Stemming , Corpus , based Approach , Natural Language Processing (NLP) , Affixes , Arabic Corpora , Light Stemmer
Journal title :
International Journal of Linguistics, Literature and Translation
Serial Year :
2018
Journal title :
International Journal of Linguistics, Literature and Translation
Record number :
2471897
Link To Document :
بازگشت