DocumentCode :
3106825
Title :
High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets
Author :
Malik, Hassan H. ; Kender, John R.
Author_Institution :
Dept. of Comput. Sci., Columbia Univ., New York, NY
fYear :
2006
fDate :
18-22 Dec. 2006
Firstpage :
991
Lastpage :
996
Abstract :
High dimensionality remains a significant challenge for document clustering. Recent approaches used frequent itemsets and closed frequent itemsets to reduce dimensionality, and to improve the efficiency of hierarchical document clustering. In this paper, we introduce the notion of "closed interesting" itemsets (i.e. closed itemsets with high interestingness). We provide heuristics such as "super item" to efficiently mine these itemsets and show that they provide significant dimensionality reduction over closed frequent itemsets. Using "closed interesting" itemsets, we propose a new, sub-linearly scalable, hierarchical document clustering method that outperforms state of the art agglomerative, partitioning and frequent-itemset based methods both in terms of clustering quality and runtime performance, without requiring dataset specific parameter tuning. We evaluate twenty interestingness measures and show that when used to generate "closed interesting" itemsets, and to select parent nodes, mutual information, added value, Yule\´s Q and Chi- Square offer best clustering performance.
Keywords :
data mining; document handling; pattern clustering; added value; closed frequent itemsets; closed interesting itemsets; clustering quality; dimensionality reduction; hierarchical document clustering; interestingness measure; itemset mining; mutual information; parent nodes; runtime performance; Association rules; Clustering algorithms; Clustering methods; Computer science; Data mining; Frequency; Itemsets; Merging; Runtime; Scalability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2006. ICDM '06. Sixth International Conference on
Conference_Location :
Hong Kong
ISSN :
1550-4786
Print_ISBN :
0-7695-2701-7
Type :
conf
DOI :
10.1109/ICDM.2006.81
Filename :
4053141
Link To Document :
بازگشت