DocumentCode :
866739
Title :
Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization
Author :
Hammouda, Khaled M. ; Kamel, Mohamed S.
Author_Institution :
Desire2Learn Inc., Kitchener, ON
Volume :
21
Issue :
5
fYear :
2009
fDate :
5/1/2009 12:00:00 AM
Firstpage :
681
Lastpage :
698
Abstract :
In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88% accuracy.
Keywords :
data mining; distributed processing; document handling; pattern clustering; peer-to-peer computing; distributed cluster summarization; distributed data mining; distributed document clusters; distributed k-means variant; distributed keyphrase extraction algorithm; flat node distribution; hierarchically distributed peer-to-peer document clustering; higher level neighborhoods; multilayer overlay network; Abstracting methods; Clustering; Data mining; Distributed systems; Text mining;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2008.189
Filename :
4626955
Link To Document :
بازگشت