مرکز منطقه ای اطلاع رساني علوم و فناوري - An Improved K-means Algorithm for Document Clustering

DocumentCode :

3722614

Title :

An Improved K-means Algorithm for Document Clustering

Author :

Guohua Wu;Hairong Lin;Ershuai Fu;Liuyang Wang

Author_Institution :

Sch. of Comput. Sci. &

fYear :

2015

Firstpage :

Lastpage :

Abstract :

K-Means algorithm has a major shortcoming of high dimensional and sparse data. So the traditional measurement of the distance can´t deal with the data effectively. Motivated by this, this paper proposed a K-Means algorithm based on Sim Hash. After preprocessing of the text, Sim Hash is used to calculate the feature vectors extracted, and then the fingerprint of each text is obtained. Sim Hash not only reduces the dimension of the text, but also directly calculates the Hamming distance between the fingerprints as the vector distance. According to the Hamming distance, it can judge which clustering the data is belongs to. Experimental result shows that the algorithm guarantees the quality of the clustering, and greatly reduces the speed of K-means clustering algorithm.

Keywords :

"Clustering algorithms","Algorithm design and analysis","Fingerprint recognition","Hamming distance","Classification algorithms","Computer science","Feature extraction"

Publisher :

ieee

Conference_Titel :

Computer Science and Mechanical Automation (CSMA), 2015 International Conference on

Type :

conf

DOI :

10.1109/CSMA.2015.20

Filename :

7371624

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3722614