DocumentCode :
3275534
Title :
A Wordsets based document clustering algorithm for large datasets
Author :
Sharma, Anuj ; Dhir, Renu
Author_Institution :
Dept. of Comput. Sci.&Eng., Dr. B. R. Ambedkar Nat. Inst. of Technol., Jalandhar, India
fYear :
2009
fDate :
14-15 Dec. 2009
Firstpage :
1
Lastpage :
7
Abstract :
Document clustering is an important tool for applications such as search engines and document browsers. It enables the user to have a good overall view of the information contained in the documents. The well-known methods of document clustering, however, do not really address the special problems of text document clustering: very high dimensionality of the document, very large size of the datasets and understandability of the cluster description. Also there is a strong need of hierarchical document clustering where clustered documents can be browsed according to the increasing specificity of topics. Frequent Itemset Hierarchical Clustering (FIHC) is a novel data mining algorithm for hierarchical grouping of text documents. The approach does not give reliable clustering results when the number of frequent sets of terms is large. In this paper we propose WDC (Wordsets-based Clustering), an efficient clustering algorithm based closed words sets. WDC uses a hierarchical approach to cluster text documents having common words. WDC found scalable, effective and efficient when compared with existing clustering algorithms like K-means and its variants.
Keywords :
data mining; pattern clustering; text analysis; Wordsets based document clustering algorithm; cluster description; data mining; document browsers; document information; frequent itemset hierarchical clustering; hierarchical document clustering; hierarchical text document grouping; search engines; topic specificity; Application software; Clustering algorithms; Computer science; Data engineering; Data mining; Databases; Itemsets; Partitioning algorithms; Robustness; Search engines; Clustering algorithm; Hierarchical document clustering; Wordsets based Clustering; document clustering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Methods and Models in Computer Science, 2009. ICM2CS 2009. Proceeding of International Conference on
Conference_Location :
Delhi
Print_ISBN :
978-1-4244-5051-0
Type :
conf
DOI :
10.1109/ICM2CS.2009.5397962
Filename :
5397962
Link To Document :
بازگشت