Title :
A novel weighting scheme for efficient document indexing and classification
Author :
Tahayna, Bashar ; Ayyasamy, Ramesh Kumar ; Alhashmi, Saadat ; Eu-Gene, Siew
Author_Institution :
Sch. of IT, Monash Univ., Bandar Sunway, Malaysia
Abstract :
In this paper we propose and illustrate the effectiveness of a new topic-based document classification method. The proposed method utilizes the Wikipedia, a large scale Web encyclopaedia that has high-quality and huge-scale articles and a category system. Wikipedia is used using an N-gram technique to transform the document from being a “bag of words” to become a “bag of concepts”. Based on this transformation, a novel concept-based weighting scheme (denoted as Conf.idf) is proposed to index the text with the flavor of the traditional tf.idf indexing scheme. Moreover, a genetic algorithm-based support vector machine optimization method is used for the purpose of feature subset and instance selection. Experimental results showed that proposed weighting scheme outperform the traditional indexing and weighting scheme.
Keywords :
genetic algorithms; indexing; pattern classification; support vector machines; text analysis; N-gram technique; Wikipedia; category system; concept-based weighting scheme; document indexing; genetic algorithm-based support vector machine optimization method; large scale Web encyclopaedia; text indexing; topic-based document classification method; Classification algorithms; Kernel; feature subset seletion; genetic algorithms; support vector machines; term weighting scheme; wikipedia;
Conference_Titel :
Information Technology (ITSim), 2010 International Symposium in
Conference_Location :
Kuala Lumpur
Print_ISBN :
978-1-4244-6715-0
DOI :
10.1109/ITSIM.2010.5561553