DocumentCode
423337
Title
A feature selection algorithm for document clustering based on word co-occurrence frequency
Author
Liu, Yuan-Chao ; Wang, Xiao-long ; Liu, Bing-quan
Author_Institution
Sch. of Comput. Sci. & Technol., Harbin Inst. of Technol., China
Volume
5
fYear
2004
fDate
26-29 Aug. 2004
Firstpage
2963
Abstract
Constructing feature space by only selecting more informative words can speed up document clustering algorithm greatly, and the cluster quality is not affected. In this paper, firstly, the impact of feature selection on document clustering is discussed, then, a new solution for feature selection was brought forward which is based on word co-occurrence frequency. According to cluster hypothesis, the documents from the same class are more similar to each other when they are represented in vector space model (VSM), so many of the words from these documents are always in company with each other. We find these words by word co-occurrence, and then construct reduced feature space for clustering. Experiments show that the selected features are more salient. Clustering documents in the new reduced feature space, run time is shortened greatly, whereas the cluster quality is almost unchanged, thus make clustering algorithm more suitable for practical use.
Keywords
feature extraction; pattern clustering; text analysis; vectors; cluster hypothesis; document clustering algorithm; feature selection algorithm; feature space construction; vector space model; word cooccurrence frequency; Clustering algorithms; Computer science; Explosives; Frequency; Internet; Navigation; Partitioning algorithms; Search engines; Space technology; Unsupervised learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on
Print_ISBN
0-7803-8403-2
Type
conf
DOI
10.1109/ICMLC.2004.1378540
Filename
1378540
Link To Document