Title :
GS-Orthogonalization Based "Basis Feature" Selection from Word Co-occurrence Matrix
Author :
Deqing Wang;Hui Zhang;Rui Liu
Author_Institution :
Sch. of Comput. Sci., Beihang Univ., Beijing, China
Abstract :
Feature selection plays an important role in machinelearning applications. Especially for text data, the highdimensionaland sparse characteristics will affect the performanceof feature selction. In this paper, an unsupervised feature selection algorithm through Random Projection and Gram-Schmidt Orthogonalization (RP-GSO) from the word co-occurrence matrix is proposed. The RP-GSO has three advantages: (1) it takes as input dense word co-occurrence matrix, avoiding the sparseness of original document-term matrix, (2) it selects "basis features" by Gram-Schmidt process, guaranteeing the orthogonalization of feature space, and (3) it adopts random projection to speed upGS process. We did extensive experiments on two real-world textcorpora, and observed that RP-GSO achieves better performancecomparing against supervised and unsupervised methods in textclassification and clustering tasks.
Keywords :
"Sparse matrices","Feature extraction","Training","Clustering algorithms","MATLAB","Computer science","Matrix decomposition"
Conference_Titel :
Data Mining (ICDM), 2015 IEEE International Conference on
DOI :
10.1109/ICDM.2015.80