• DocumentCode
    2766087
  • Title

    Document Clustering Using Concept Space and Cosine Similarity Measurement

  • Author

    Muflikhah, Lailil ; Baharudin, Baharum

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Univ. Teknol. Petronas, Tronoh, Malaysia
  • Volume
    1
  • fYear
    2009
  • fDate
    13-15 Nov. 2009
  • Firstpage
    58
  • Lastpage
    62
  • Abstract
    Document clustering is related to data clustering concept which is one of data mining tasks and unsupervised classification. It is often applied to the huge data in order to make a partition based on their similarity. Initially, it used for Information Retrieval in order to improve the precision and recall from query. It is very easy to cluster with small data attributes which contains of important items. Furthermore, document clustering is very useful in retrieve information application in order to reduce the consuming time and get high precision and recall. Therefore, we propose to integrate the information retrieval method and document clustering as concept space approach. The method is known as Latent Semantic Index (LSI) approach which used Singular Vector Decomposition (SVD) or Principle Component Analysis (PCA). The aim of this method is to reduce the matrix dimension by finding the pattern in document collection with refers to concurrent of the terms. Each method is implemented to weight of term-document in vector space model (VSM) for document clustering using fuzzy c-means algorithm. Besides reduction of term-document matrix, this research also uses the cosine similarity measurement as replacement of Euclidean distance to involve in fuzzy c-means. And as a result, the performance of the proposed method is better than the existing method with f-measure around 0.91 and entropy around 0.51.
  • Keywords
    data mining; document handling; fuzzy set theory; matrix algebra; pattern classification; pattern clustering; principal component analysis; singular value decomposition; vectors; Euclidean distance; concept space; cosine similarity measurement; data attributes; data clustering concept; data mining tasks; document clustering; document collection; fuzzy c-means algorithm; information retrieval; latent semantic index approach; matrix dimension; principle component analysis; singular vector decomposition; term-document matrix; unsupervised classification; vector space model; Clustering algorithms; Clustering methods; Data mining; Euclidean distance; Extraterrestrial measurements; Information retrieval; Information science; Large scale integration; Principal component analysis; Space technology; LSI; cosine similarity; data mining; document clustering; fuzzy c-means;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Technology and Development, 2009. ICCTD '09. International Conference on
  • Conference_Location
    Kota Kinabalu
  • Print_ISBN
    978-0-7695-3892-1
  • Type

    conf

  • DOI
    10.1109/ICCTD.2009.206
  • Filename
    5359952