DocumentCode
1277750
Title
A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification
Author
Jiang, Jung-Yi ; Liou, Ren-Jia ; Lee, Shie-Jue
Author_Institution
Dept. of Electr. Eng., Nat. Sun Yat-Sen Univ., Kaohsiung, Taiwan
Volume
23
Issue
3
fYear
2011
fDate
3/1/2011 12:00:00 AM
Firstpage
335
Lastpage
349
Abstract
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters, based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature, corresponding to a cluster, is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. Experimental results show that our method can run faster and obtain better extracted features than other methods.
Keywords
fuzzy set theory; pattern clustering; statistical analysis; text analysis; derived membership functions; deviation; fuzzy self-constructing feature clustering algorithm; statistical mean; text classification; Clustering algorithms; Clustering methods; Complexity theory; Feature extraction; Support vector machines; Training; Training data; Fuzzy similarity; feature clustering; feature extraction; feature reduction; text classification.;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2010.122
Filename
5530315
Link To Document