Title :
An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop
Author :
Sahu, Lokesh ; Mohan, Biju R.
Author_Institution :
Dept. Of Inf. Technol., Nat. Inst. of Technol. Karnataka, Surathkal, India
Abstract :
In this paper, we have proposed a novel K-means algorithm with modified Cosine Distance Measure for clustering of large datasets like Wikipedia latest articles and Reuters dataset. We are customizing Cosine Distance Measure for computing similarity between objects for improving cluster quality. Our method will calculate the similarity between objects by Cosine Distance Measure and then try to bring distance more closer by squaring the distance if it is between 0 to 0.5 else increase it. It will result in minimum Intra-cluster and maximizes Inter-cluster distance value. We are measuring cluster quality in term of Inter and Intra-cluster distances, good Feature weighting such as TF-IDF, Cluster Size and Top terms of the clusters. We have compared K-means algorithm by Cosine and modified Cosine Distance measure by setting performance metric such as Inter-cluster and Intra-cluster distances, Cluster size, Execution time etc. Our experimental result shows in minimizing Intra-cluster by 0.016% and maximizing Inter-cluster distance by 0.012%, reducing the cluster size by 1.5% and reducing sequence file size by 4%, that will result in good cluster quality.
Keywords :
document handling; parallel processing; pattern clustering; Hadoop; Mahout; Reuters dataset; TF-IDF; Wikipedia; cluster quality improvement; cluster quality measurement; cluster size reduction; document clustering; execution time; feature weighting; intercluster distance value maximization; k-means algorithm; minimum intracluster distance value; modified cosine distance measure; object similarity analysis; performance metric; sequence file size reduction; Algorithm design and analysis; Clustering algorithms; Encyclopedias; Internet; Size measurement; Time measurement; Vectors; Document Clustering; Hadoop; K-means; Mahout;
Conference_Titel :
Industrial and Information Systems (ICIIS), 2014 9th International Conference on
Conference_Location :
Gwalior
Print_ISBN :
978-1-4799-6499-4
DOI :
10.1109/ICIINFS.2014.7036661