• DocumentCode
    265183
  • Title

    An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop

  • Author

    Sahu, Lokesh ; Mohan, Biju R.

  • Author_Institution
    Dept. Of Inf. Technol., Nat. Inst. of Technol. Karnataka, Surathkal, India
  • fYear
    2014
  • fDate
    15-17 Dec. 2014
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    In this paper, we have proposed a novel K-means algorithm with modified Cosine Distance Measure for clustering of large datasets like Wikipedia latest articles and Reuters dataset. We are customizing Cosine Distance Measure for computing similarity between objects for improving cluster quality. Our method will calculate the similarity between objects by Cosine Distance Measure and then try to bring distance more closer by squaring the distance if it is between 0 to 0.5 else increase it. It will result in minimum Intra-cluster and maximizes Inter-cluster distance value. We are measuring cluster quality in term of Inter and Intra-cluster distances, good Feature weighting such as TF-IDF, Cluster Size and Top terms of the clusters. We have compared K-means algorithm by Cosine and modified Cosine Distance measure by setting performance metric such as Inter-cluster and Intra-cluster distances, Cluster size, Execution time etc. Our experimental result shows in minimizing Intra-cluster by 0.016% and maximizing Inter-cluster distance by 0.012%, reducing the cluster size by 1.5% and reducing sequence file size by 4%, that will result in good cluster quality.
  • Keywords
    document handling; parallel processing; pattern clustering; Hadoop; Mahout; Reuters dataset; TF-IDF; Wikipedia; cluster quality improvement; cluster quality measurement; cluster size reduction; document clustering; execution time; feature weighting; intercluster distance value maximization; k-means algorithm; minimum intracluster distance value; modified cosine distance measure; object similarity analysis; performance metric; sequence file size reduction; Algorithm design and analysis; Clustering algorithms; Encyclopedias; Internet; Size measurement; Time measurement; Vectors; Document Clustering; Hadoop; K-means; Mahout;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Industrial and Information Systems (ICIIS), 2014 9th International Conference on
  • Conference_Location
    Gwalior
  • Print_ISBN
    978-1-4799-6499-4
  • Type

    conf

  • DOI
    10.1109/ICIINFS.2014.7036661
  • Filename
    7036661