DocumentCode :
3579180
Title :
A Novel Clustering Algorithm on Large-Scale Graph Data
Author :
Hao Zhang ; Wei Zhou ; Xiaoyu Wan ; Ge Fu ; Zhiyong Xu ; Jizhong Han
Author_Institution :
Sch. of Commun. & Inf. Eng., Chongqing Univ. of Posts & Telecommun., Chongqing, China
fYear :
2014
Firstpage :
47
Lastpage :
54
Abstract :
Large scale Social Network Service (SNS) applications use graph as the default data organization model. To discover hidden useful information, data clustering techniques are widely adopted in data analysis tasks. State-of-the-art clustering algorithms are based on distributed programming paradigms such as MapReduce and Bulk Synchronization Parallel (BSP) models. These solutions suffer from heavy data transmission cost as well as excessive storage overhead problems. Thus, they can hardly achieve the optimal performance. In this paper, we propose a novel clustering algorithm based on pagerank to relieve these issues. Compare to current systems, our algorithm can achieve similar clustering results on large-scale graph data with much lower network and storage overhead. It consists of three steps. First, it calculates the pagerank values for each vertex based on the underlying data set. Then, we select multiple vertexes from the list as clustering centers. Third, our algorithm expands the range of the cluster by adding more neighboring vertexes iteratively. We compare our algorithm with other popular clustering algorithms using real world data sets and the results show that it achieves better performance than any other distributed solutions. For example, on a web graph data set with hundreds of thousands of vertexes and millions of edges, the time and memory consumption can be reduced by more than 95% and 70%, respectively.
Keywords :
Internet; data analysis; distributed programming; graph theory; parallel processing; pattern clustering; social networking (online); storage management; BSP model; MapReduce; SNS application; Web graph data set; bulk synchronization parallel model; clustering center; data analysis task; data clustering technique; default data organization model; distributed programming paradigm; distributed solution; heavy data transmission cost; large-scale graph data; memory consumption; optimal performance; pagerank; popular clustering algorithm; real world data set; social network service application; storage overhead problem; Algorithm design and analysis; Attenuation; Clustering algorithms; Computational modeling; Data analysis; Distributed databases;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing and Big Data (CCBD), 2014 International Conference on
Type :
conf
DOI :
10.1109/CCBD.2014.23
Filename :
7062871
Link To Document :
بازگشت