DocumentCode
2766087
Title
Document Clustering Using Concept Space and Cosine Similarity Measurement
Author
Muflikhah, Lailil ; Baharudin, Baharum
Author_Institution
Dept. of Comput. & Inf. Sci., Univ. Teknol. Petronas, Tronoh, Malaysia
Volume
1
fYear
2009
fDate
13-15 Nov. 2009
Firstpage
58
Lastpage
62
Abstract
Document clustering is related to data clustering concept which is one of data mining tasks and unsupervised classification. It is often applied to the huge data in order to make a partition based on their similarity. Initially, it used for Information Retrieval in order to improve the precision and recall from query. It is very easy to cluster with small data attributes which contains of important items. Furthermore, document clustering is very useful in retrieve information application in order to reduce the consuming time and get high precision and recall. Therefore, we propose to integrate the information retrieval method and document clustering as concept space approach. The method is known as Latent Semantic Index (LSI) approach which used Singular Vector Decomposition (SVD) or Principle Component Analysis (PCA). The aim of this method is to reduce the matrix dimension by finding the pattern in document collection with refers to concurrent of the terms. Each method is implemented to weight of term-document in vector space model (VSM) for document clustering using fuzzy c-means algorithm. Besides reduction of term-document matrix, this research also uses the cosine similarity measurement as replacement of Euclidean distance to involve in fuzzy c-means. And as a result, the performance of the proposed method is better than the existing method with f-measure around 0.91 and entropy around 0.51.
Keywords
data mining; document handling; fuzzy set theory; matrix algebra; pattern classification; pattern clustering; principal component analysis; singular value decomposition; vectors; Euclidean distance; concept space; cosine similarity measurement; data attributes; data clustering concept; data mining tasks; document clustering; document collection; fuzzy c-means algorithm; information retrieval; latent semantic index approach; matrix dimension; principle component analysis; singular vector decomposition; term-document matrix; unsupervised classification; vector space model; Clustering algorithms; Clustering methods; Data mining; Euclidean distance; Extraterrestrial measurements; Information retrieval; Information science; Large scale integration; Principal component analysis; Space technology; LSI; cosine similarity; data mining; document clustering; fuzzy c-means;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Technology and Development, 2009. ICCTD '09. International Conference on
Conference_Location
Kota Kinabalu
Print_ISBN
978-0-7695-3892-1
Type
conf
DOI
10.1109/ICCTD.2009.206
Filename
5359952
Link To Document