Title of article :
A novel approach to the extraction of roots from Arabic words using bigrams
Author/Authors :
Ismail I. Hmeidi1، نويسنده , ,
Riyad F. Al-Shalabi2، نويسنده , ,
Ahmad T. Al-Taani3، نويسنده , ,
Hassan Najadat4، نويسنده , ,
Shaker A. Al-Hazaimeh4، نويسنده ,
Issue Information :
ماهنامه با شماره پیاپی سال 2010
Abstract :
Root extraction is one of the most important topics in information retrieval (IR), natural language processing (NLP), text summarization, and many other important fields. In the last two decades, several algorithms have been proposed to extract Arabic roots. Most of these algorithms dealt with triliteral roots only, and some with fixed length words only. In this study, a novel approach to the extraction of roots from Arabic words using bigrams is proposed. Two similarity measures are used, the dissimilarity measure called the “Manhattan distance,” and Diceʹs measure of similarity. The proposed algorithm is tested on the Holy Quʹran and on a corpus of 242 abstracts from the Proceedings of the Saudi Arabian National Computer Conferences. The two files used contain a wide range of data: the Holy Quʹran contains most of the ancient Arabic words while the other file contains some modern Arabic words and some words borrowed from foreign languages in addition to the original Arabic words. The results of this study showed that combining N-grams with the Dice measure gives better results than using the Manhattan distance measure.
Journal title :
Journal of the American Society for Information Science and Technology
Journal title :
Journal of the American Society for Information Science and Technology