مرکز منطقه ای اطلاع رساني علوم و فناوري - An improved root extraction technique for Arabic words

DocumentCode :

3238826

Title :

An improved root extraction technique for Arabic words

Author :

Al-Nashashibi, May Y. ; Neagu, D. ; Yaghi, Ali A.

Author_Institution :

Dept. of Comput., Univ. of Bradford, Bradford, UK

fYear :

2010

fDate :

2-4 Nov. 2010

Firstpage :

264

Lastpage :

269

Abstract :

Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based light stemmer and a pattern-based infix remover. We propose an algorithm to handle weak, eliminated-long-vowel, hamzated, and geminated words since the linguistic approach does not handle such cases and a reasonably large portion of Arabic words in texts are irregular. The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots. The linguistic approach performance (with and without the proposed correction algorithm) was tested on an in-house text collection of eight categories. The proposed correction algorithm improved the accuracy of the linguistic one by about 14%.

Keywords :

data mining; natural language processing; text analysis; Arabic text interpretation; Arabic text mining; Arabic words; improved root extraction technique; linguistic approach; pattern-based infix remover; rule-based light stemmer; Pragmatics; Weaving; Arabic Root Extraction; Natural Language Processing; Rule-Based Stemming; Text Mining;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer Technology and Development (ICCTD), 2010 2nd International Conference on

Conference_Location :

Cairo

Print_ISBN :

978-1-4244-8844-5

Electronic_ISBN :

978-1-4244-8845-2

Type :

conf

DOI :

10.1109/ICCTD.2010.5645872

Filename :

5645872

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3238826