Title :
Compression-based arabic text classification
Author :
Ta´amneh, Haneen ; Abu Keshek, Ehsan ; Issa, Manar Bani ; Al-Ayyoub, Mahmoud ; Jararweh, Yaser
Author_Institution :
Jordan Univ. of Sci. & Technol., Irbid, Jordan
Abstract :
Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages such as English and Portuguese and it is shown to have certain advantages/ disadvantages compared with word-based approaches. This work applies CTC on the Arabic language with the purpose of investigating whether these advantages/disadvantages exists for the Arabic language as well. The results are encouraging as they show the viability of using CTC for Arabic TC.
Keywords :
classification; data mining; feature extraction; natural language processing; text analysis; Arabic language; CTC; English; Portuguese; compression-based Arabic text classification; compression-based TC; feature extraction; text mining; word-based approach; Accuracy; Compression algorithms; Dictionaries; Natural language processing; Niobium; Testing; Training;
Conference_Titel :
Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on
DOI :
10.1109/AICCSA.2014.7073253