DocumentCode :
3722614
Title :
An Improved K-means Algorithm for Document Clustering
Author :
Guohua Wu;Hairong Lin;Ershuai Fu;Liuyang Wang
Author_Institution :
Sch. of Comput. Sci. &
fYear :
2015
Firstpage :
65
Lastpage :
69
Abstract :
K-Means algorithm has a major shortcoming of high dimensional and sparse data. So the traditional measurement of the distance can´t deal with the data effectively. Motivated by this, this paper proposed a K-Means algorithm based on Sim Hash. After preprocessing of the text, Sim Hash is used to calculate the feature vectors extracted, and then the fingerprint of each text is obtained. Sim Hash not only reduces the dimension of the text, but also directly calculates the Hamming distance between the fingerprints as the vector distance. According to the Hamming distance, it can judge which clustering the data is belongs to. Experimental result shows that the algorithm guarantees the quality of the clustering, and greatly reduces the speed of K-means clustering algorithm.
Keywords :
"Clustering algorithms","Algorithm design and analysis","Fingerprint recognition","Hamming distance","Classification algorithms","Computer science","Feature extraction"
Publisher :
ieee
Conference_Titel :
Computer Science and Mechanical Automation (CSMA), 2015 International Conference on
Type :
conf
DOI :
10.1109/CSMA.2015.20
Filename :
7371624
Link To Document :
بازگشت