مرکز منطقه ای اطلاع رساني علوم و فناوري - Towards Corpus-Based Stemming for Arabic Texts

Title of article :

Towards Corpus-Based Stemming for Arabic Texts

Author/Authors :

Sabtan ، Yasser Muhammad Naguib - Dhofar University

Pages :

From page :

119

To page :

129

Abstract :

Stemming is an essential processing step in a number of natural language processing (NLP) applications such as information extraction, text analysis and machine translation. It is the process of reducing words to their stems. This paper presents a light stemmer for Arabic, using a corpus-based approach. The stemmer groups morphological variants of words in an Arabic corpus based on shared characters, before stripping off their affixes (prefixes and suffixes) to produce their common stem. Experimental results show that 86% of words in the test set were correctly grouped under a similar reduced form (i.e. the possible stem). In some cases the reduced form is not the legitimate stem. The evaluation shows that 72.2% of the words in the test set were reduced to their legitimate stem. The current stemmer is developed with the future aim of investigating the effectiveness of using word stems for extracting bilingual equivalents from an Arabic-English parallel corpus

Keywords :

Arabic , Stemming , Corpus , based Approach , Natural Language Processing (NLP) , Affixes , Arabic Corpora , Light Stemmer

Journal title :

International Journal of Linguistics, Literature and Translation

Serial Year :

2018

Journal title :

International Journal of Linguistics, Literature and Translation

Record number :

2471897

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=2471897