Title :
An improved root extraction technique for Arabic words
Author :
Al-Nashashibi, May Y. ; Neagu, D. ; Yaghi, Ali A.
Author_Institution :
Dept. of Comput., Univ. of Bradford, Bradford, UK
Abstract :
Arabic text interpretation depends among other things on a pre-processing stage in extracting effectively a correct stem or root. We address in this work a linguistic approach for root extraction as a pre-processing step for Arabic text mining. The linguistic approach is composed of a rule-based light stemmer and a pattern-based infix remover. We propose an algorithm to handle weak, eliminated-long-vowel, hamzated, and geminated words since the linguistic approach does not handle such cases and a reasonably large portion of Arabic words in texts are irregular. The accuracy of the extracted roots is determined by comparing them with a predefined list of 5,405 triliteral and quadriliteral roots. The linguistic approach performance (with and without the proposed correction algorithm) was tested on an in-house text collection of eight categories. The proposed correction algorithm improved the accuracy of the linguistic one by about 14%.
Keywords :
data mining; natural language processing; text analysis; Arabic text interpretation; Arabic text mining; Arabic words; improved root extraction technique; linguistic approach; pattern-based infix remover; rule-based light stemmer; Pragmatics; Weaving; Arabic Root Extraction; Natural Language Processing; Rule-Based Stemming; Text Mining;
Conference_Titel :
Computer Technology and Development (ICCTD), 2010 2nd International Conference on
Conference_Location :
Cairo
Print_ISBN :
978-1-4244-8844-5
Electronic_ISBN :
978-1-4244-8845-2
DOI :
10.1109/ICCTD.2010.5645872