Title :
Feature extraction for co-occurrence-based cosine similarity score of text documents
Author :
Kadhim, Ammar Ismael ; Cheah, Yu.-N. ; Ahamed, Nurul Hashimah ; Salman, Lubab A.
Author_Institution :
Sch. of Comput. Sci., Univ. Sains Malaysia, Minden, Malaysia
Abstract :
A major challenge in topic classification (TC) is the high dimensionality of the feature space. Therefore, feature extraction (FE) plays a vital role in topic classification in particular and text mining in general. FE based on cosine similarity score is commonly used to reduce the dimensionality of datasets with tens or hundreds of thousands of features, which can be impossible to process further. In this study, TF-IDF term weighting is used to extract features. Selecting relevant features and determining how to encode them for a learning machine method have a vast impact on the learning machine methods ability to extract a good model. Two different weighting methods (TF-IDF and TF-IDF Global) were used and tested on the Reuters-21578 text categorization test collection. The obtained results emerged a good candidate for enhancing the performance of English topics FE. Simulation results the Reuters-21578 text categorization show the superiority of the proposed algorithm.
Keywords :
data mining; feature extraction; feature selection; learning (artificial intelligence); pattern classification; text analysis; FE; Reuters-21578 text categorization test collection; TC; TF-IDF global method; TF-IDF term weighting; co-occurrence-based cosine similarity score; dataset dimensionality reduction; feature extraction; feature selection; feature space high dimensionality; learning machine method; text documents; text mining; topic classification; Feature extraction; Indexing; Iron; Measurement; Text categorization; Vectors; Vocabulary; TF-IDF weighting; cosine similarity score; feature extraction; topic classification;
Conference_Titel :
Research and Development (SCOReD), 2014 IEEE Student Conference on
Print_ISBN :
978-1-4799-6427-7
DOI :
10.1109/SCORED.2014.7072954